inferring correct positions on original MDAsim input sequence

dlaehnemann commented 6 years ago

As you were saying, @VSack, that the positions currently reported by MDAsim seem very inconsistent, we should opt for simply substituting the current position = <LPOF> with something like original_position = <OP>, as you suggested in an ePySCOPE discussion. Great that you already started on this!

VSack commented 6 years ago

We now have a header

>R1 | length = 100450 |fragment = 0 | position = 100450 | ref = 1 | strand = +

So each header starts with the unique read number and has the length at second position (as discussed in #14). The fragment and the position fields hold the same information as in 1.2, however I've added refwhich points to the starting position on the original reference strand (1-based) as well as strand which states if this is the positive strand or the reverse one. The ref-value counts from 1 to length both for positive and negative strands.

However there remains one issue: I've written a script that checks if the sequences (amplicon and original reference) actually match each other and it performs well for the most part. However some sequences seem to have negative values for ref (like 4 out of 372 in my current test data). I'm not sure yet where this is coming from.

dlaehnemann commented 6 years ago

Seems like someone is back from vacation! ;)

Thanks already for closing #14, one little question there: the read number (R1 above) and the fragment number (fragment = 0) seem to be redundant information, i.e. the same value just off by one because one is the 0-based index counter in the source code and the other the 1-based name that is output. If so, I would probably remove the fragment number from the header in one of the commits for this issue. Or am I missing something?

Otherwise, thanks for testing the new output in this detail -- it's a shame this code didn't come with unit tests, and I think introducing some now would be a bit too much overhead...

As for your negative values, some quick ideas: How far negative are they? If just slightly below zero, they might be attributable to off-by-one errors? I also had a quick look at the code area that you modified for this, but have to admit I couldn't immediately follow the logic there. Maybe you could walk me through your understanding of the variables and the code in a videocall?

VSack commented 6 years ago

Concerning the fragment, according to the ReadMe, the value is not the same counter as the read ID but references the read ID of the fragment that this amplicon was created from. 0 is the input fasta sequence.

As for the negative values, think -1000 and even less. So no off-by-one-errors, unfortunately.

VSack commented 6 years ago

So I did some more analysis on the positions that MDAsim is putting out vs. the positions it should put out.

As you can also see in the latest commit for this issue, the script tries to map the amplicon onto the original sequence at different positions, indicated through a list of tuples corrections = [(<offset>, <reverse_and_complement_string)].

For a list of 343 amplicons, I'm getting the following counts for different tuples:

63: (-1, False) These are the positive-strand-sequences that can be aligned as they are. The -1 is caused by the 1-based-positions for positive strands caused by MDAsim
99: (0, True) These are the negative-strand-sequences that need to be reversed and complemented but fit besides that
18: (1, False) Positive strand sequences that are basically off by 2
88: (-len(am_seq)+1, False) are positive strand sequences that are off by the length of the amplicon sequence
75: (len(am_seq), True) need to be reverse and complement and are off by the length of the sequence, but in the other direction

Even though there's a pattern showing here, I'm not sure yet where these offsets are coming from. They'd need to be sorted by whether

it's a positive or negative strand
it's the last read on a fragment
maybe also if it was occupied or not (you can never be sure)

I'm certain that it has something to do with my workaround to get the positions where a new fragment starts and thus the original position has not been copied (we talked about that in our video call), but finding the actual causes would be the next step.

VSack commented 6 years ago

Follow-up

For the 343 amplicons above, we got the groups

corrections = [(-1, False), (0, True), (1, False), (((-1)*len(am_seq))+1, False), ((len(am_seq)), True)]

with the counts [63, 99, 18, 88, 75] in total. Further grouping gave me:

[0, 0, 18, 0, 75] for amplicons being the last on their fragmentand being the positive strand according to the read list
[0, 30, 0, 88, 0] for amplicons being the last on their fragment and being the negative strand according to the read list
[63, 69, 0, 0, 0] for amplicons not being the last on their fragment

What's odd is the fact that for the last two positions in those counts, positive and negative seems to be switched. For the 75 in the first list here, the read list says they are on the positive strand, but the script needs to reverse/complement in order to find a match for them. Same way around for the 88 in the second list: according to the read list, it's the negative strand, but fits if treated like a positive strand.

That's befuddling as is.

But wait.

There is more.

If I switch the last two positions in the corrections-list like:

corrections = [(-1, False), (0, True), (1, False), ((len(am_seq)), True), (((-1)*len(am_seq))+1, False)]

I get different counts (the script stops when it finds a match for an amplicon and does not try further corrections):

[63, 99, 18, 85, 78] total
[0, 0, 18, 75, 0] for amplicons being the last on their fragmentand being the positive strand according to the read list
[0, 30, 0, 10, 78]for amplicons being the last on their fragment and being the negative strand according to the read list
[63, 69, 0, 0, 0] for amplicons not being the last on their fragment

So all of a sudden we also have amplicons that ARE the last on their fragment, that NEED to be treated like a negative strand, and that are off by their length. We didn't have these 10 amplicons in the previous grouping because they matched elsewhere.

Which leads us back to the question, what's the ground truth (if we have multiple matching possibilities) and how all of this does link with the MDAsim algorithm...

VSack commented 6 years ago

Finally, this issue was solved all of a sudden. Here's the twist:

Whenever a new fragment is created, the first bases need to get the right original from the reference
When there has been no value set by any chance, e.g. if the pos.original == 0, we infer the value from the preceding base
In the output file, all positions are now 0-based
Fragments that have strand = - need to be reverted and complemented when you want to align them to the original sequence, but can then be aligned starting from the position of ref

When you run the current version of the positionTesting.py-script, you'll hopefully get an output like this:

[(0, False), (0, True), (1, False), (<len>, False), (<len>, True)]
SUCCESSFUL matches per correction: [176, 170, 0, 0, 0]
on + strand [176, 0, 0, 0, 0]
on - strand [0, 170, 0, 0, 0]

which means that we FINALLY have what we wanted and we can now go on, either with commenting the code further or passing these positions on to following steps of ePySCOPE.

---EDIT--- Will do a little output clean-up in a minute, but I wanted to commit the working state before it's gone.

VSack commented 6 years ago

The lines of interest were https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L759 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L798

https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1029-L1045 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1070-L1086

hzi-bifo / mdasim

inferring correct positions on original MDAsim input sequence #15

Follow-up

There is more.