hzi-bifo / mdasim

MDAsim 2: a Multiple Displacement Amplification Simulator (continued development from: https://sourceforge.net/projects/mdasim/)
GNU General Public License v3.0
1 stars 1 forks source link

inferring correct positions on original MDAsim input sequence #15

Closed dlaehnemann closed 6 years ago

dlaehnemann commented 6 years ago

As you were saying, @VSack, that the positions currently reported by MDAsim seem very inconsistent, we should opt for simply substituting the current position = <LPOF> with something like original_position = <OP>, as you suggested in an ePySCOPE discussion. Great that you already started on this!

VSack commented 6 years ago

We now have a header

>R1 | length = 100450 |fragment = 0 | position = 100450 | ref = 1 | strand = + 

So each header starts with the unique read number and has the length at second position (as discussed in #14). The fragment and the position fields hold the same information as in 1.2, however I've added refwhich points to the starting position on the original reference strand (1-based) as well as strand which states if this is the positive strand or the reverse one. The ref-value counts from 1 to length both for positive and negative strands.

However there remains one issue: I've written a script that checks if the sequences (amplicon and original reference) actually match each other and it performs well for the most part. However some sequences seem to have negative values for ref (like 4 out of 372 in my current test data). I'm not sure yet where this is coming from.

dlaehnemann commented 6 years ago

Seems like someone is back from vacation! ;)

Thanks already for closing #14, one little question there: the read number (R1 above) and the fragment number (fragment = 0) seem to be redundant information, i.e. the same value just off by one because one is the 0-based index counter in the source code and the other the 1-based name that is output. If so, I would probably remove the fragment number from the header in one of the commits for this issue. Or am I missing something?

Otherwise, thanks for testing the new output in this detail -- it's a shame this code didn't come with unit tests, and I think introducing some now would be a bit too much overhead...

As for your negative values, some quick ideas: How far negative are they? If just slightly below zero, they might be attributable to off-by-one errors? I also had a quick look at the code area that you modified for this, but have to admit I couldn't immediately follow the logic there. Maybe you could walk me through your understanding of the variables and the code in a videocall?

VSack commented 6 years ago

Concerning the fragment, according to the ReadMe, the value is not the same counter as the read ID but references the read ID of the fragment that this amplicon was created from. 0 is the input fasta sequence.

As for the negative values, think -1000 and even less. So no off-by-one-errors, unfortunately.

VSack commented 6 years ago

So I did some more analysis on the positions that MDAsim is putting out vs. the positions it should put out.

As you can also see in the latest commit for this issue, the script tries to map the amplicon onto the original sequence at different positions, indicated through a list of tuples corrections = [(<offset>, <reverse_and_complement_string)].

For a list of 343 amplicons, I'm getting the following counts for different tuples:

Even though there's a pattern showing here, I'm not sure yet where these offsets are coming from. They'd need to be sorted by whether

I'm certain that it has something to do with my workaround to get the positions where a new fragment starts and thus the original position has not been copied (we talked about that in our video call), but finding the actual causes would be the next step.

VSack commented 6 years ago

Follow-up

For the 343 amplicons above, we got the groups

corrections = [(-1, False), (0, True), (1, False), (((-1)*len(am_seq))+1, False), ((len(am_seq)), True)]

with the counts [63, 99, 18, 88, 75] in total. Further grouping gave me:

What's odd is the fact that for the last two positions in those counts, positive and negative seems to be switched. For the 75 in the first list here, the read list says they are on the positive strand, but the script needs to reverse/complement in order to find a match for them. Same way around for the 88 in the second list: according to the read list, it's the negative strand, but fits if treated like a positive strand.

That's befuddling as is.

But wait.

There is more.

If I switch the last two positions in the corrections-list like:

corrections = [(-1, False), (0, True), (1, False), ((len(am_seq)), True), (((-1)*len(am_seq))+1, False)]

I get different counts (the script stops when it finds a match for an amplicon and does not try further corrections):

So all of a sudden we also have amplicons that ARE the last on their fragment, that NEED to be treated like a negative strand, and that are off by their length. We didn't have these 10 amplicons in the previous grouping because they matched elsewhere.

Which leads us back to the question, what's the ground truth (if we have multiple matching possibilities) and how all of this does link with the MDAsim algorithm...

VSack commented 6 years ago

Finally, this issue was solved all of a sudden. Here's the twist:

  1. Whenever a new fragment is created, the first bases need to get the right original from the reference
  2. When there has been no value set by any chance, e.g. if the pos.original == 0, we infer the value from the preceding base
  3. In the output file, all positions are now 0-based
  4. Fragments that have strand = - need to be reverted and complemented when you want to align them to the original sequence, but can then be aligned starting from the position of ref

When you run the current version of the positionTesting.py-script, you'll hopefully get an output like this:

[(0, False), (0, True), (1, False), (<len>, False), (<len>, True)]
SUCCESSFUL matches per correction: [176, 170, 0, 0, 0]
on + strand [176, 0, 0, 0, 0]
on - strand [0, 170, 0, 0, 0]

which means that we FINALLY have what we wanted and we can now go on, either with commenting the code further or passing these positions on to following steps of ePySCOPE.

---EDIT--- Will do a little output clean-up in a minute, but I wanted to commit the working state before it's gone.

VSack commented 6 years ago

The lines of interest were https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L759 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L798

https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1029-L1045 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1070-L1086