Closed dlaehnemann closed 6 years ago
We now have a header
>R1 | length = 100450 |fragment = 0 | position = 100450 | ref = 1 | strand = +
So each header starts with the unique read
number and has the length
at second position (as discussed in #14). The fragment
and the position
fields hold the same information as in 1.2, however I've added ref
which points to the starting position on the original reference strand (1-based) as well as strand
which states if this is the positive strand or the reverse one.
The ref
-value counts from 1 to length
both for positive and negative strands.
However there remains one issue: I've written a script that checks if the sequences (amplicon and original reference) actually match each other and it performs well for the most part. However some sequences seem to have negative values for ref
(like 4 out of 372 in my current test data). I'm not sure yet where this is coming from.
Seems like someone is back from vacation! ;)
Thanks already for closing #14, one little question there: the read number (R1
above) and the fragment number (fragment = 0
) seem to be redundant information, i.e. the same value just off by one because one is the 0-based index counter in the source code and the other the 1-based name that is output. If so, I would probably remove the fragment number from the header in one of the commits for this issue. Or am I missing something?
Otherwise, thanks for testing the new output in this detail -- it's a shame this code didn't come with unit tests, and I think introducing some now would be a bit too much overhead...
As for your negative values, some quick ideas: How far negative are they? If just slightly below zero, they might be attributable to off-by-one errors? I also had a quick look at the code area that you modified for this, but have to admit I couldn't immediately follow the logic there. Maybe you could walk me through your understanding of the variables and the code in a videocall?
Concerning the fragment, according to the ReadMe, the value is not the same counter as the read ID but references the read ID of the fragment that this amplicon was created from. 0
is the input fasta sequence.
As for the negative values, think -1000
and even less. So no off-by-one-errors, unfortunately.
So I did some more analysis on the positions that MDAsim is putting out vs. the positions it should put out.
As you can also see in the latest commit for this issue, the script tries to map the amplicon onto the original sequence at different positions, indicated through a list of tuples corrections = [(<offset>, <reverse_and_complement_string)]
.
For a list of 343 amplicons, I'm getting the following counts for different tuples:
(-1, False)
These are the positive-strand-sequences that can be aligned as they are. The -1
is caused by the 1-based
-positions for positive strands caused by MDAsim(0, True)
These are the negative-strand-sequences that need to be reversed and complemented but fit besides that(1, False)
Positive strand sequences that are basically off by 2(-len(am_seq)+1, False)
are positive strand sequences that are off by the length of the amplicon sequence(len(am_seq), True)
need to be reverse and complement and are off by the length of the sequence, but in the other directionEven though there's a pattern showing here, I'm not sure yet where these offsets are coming from. They'd need to be sorted by whether
I'm certain that it has something to do with my workaround to get the positions where a new fragment starts and thus the original position has not been copied (we talked about that in our video call), but finding the actual causes would be the next step.
For the 343 amplicons above, we got the groups
corrections = [(-1, False), (0, True), (1, False), (((-1)*len(am_seq))+1, False), ((len(am_seq)), True)]
with the counts [63, 99, 18, 88, 75]
in total. Further grouping gave me:
[0, 0, 18, 0, 75]
for amplicons being the last on their fragmentand being the positive strand according to the read list[0, 30, 0, 88, 0]
for amplicons being the last on their fragment and being the negative strand according to the read list[63, 69, 0, 0, 0]
for amplicons not being the last on their fragmentWhat's odd is the fact that for the last two positions in those counts, positive and negative seems to be switched. For the 75
in the first list here, the read list says they are on the positive strand, but the script needs to reverse/complement in order to find a match for them.
Same way around for the 88
in the second list: according to the read list, it's the negative strand, but fits if treated like a positive strand.
That's befuddling as is.
But wait.
If I switch the last two positions in the corrections
-list like:
corrections = [(-1, False), (0, True), (1, False), ((len(am_seq)), True), (((-1)*len(am_seq))+1, False)]
I get different counts (the script stops when it finds a match for an amplicon and does not try further corrections):
[63, 99, 18, 85, 78]
total [0, 0, 18, 75, 0]
for amplicons being the last on their fragmentand being the positive strand according to the read list[0, 30, 0, 10, 78]
for amplicons being the last on their fragment and being the negative strand according to the read list[63, 69, 0, 0, 0]
for amplicons not being the last on their fragmentSo all of a sudden we also have amplicons that ARE the last on their fragment, that NEED to be treated like a negative strand, and that are off by their length. We didn't have these 10
amplicons in the previous grouping because they matched elsewhere.
Which leads us back to the question, what's the ground truth (if we have multiple matching possibilities) and how all of this does link with the MDAsim algorithm...
Finally, this issue was solved all of a sudden. Here's the twist:
pos.original == 0
, we infer the value from the preceding basestrand = -
need to be reverted and complemented when you want to align them to the original sequence, but can then be aligned starting from the position of ref
When you run the current version of the positionTesting.py
-script, you'll hopefully get an output like this:
[(0, False), (0, True), (1, False), (<len>, False), (<len>, True)]
SUCCESSFUL matches per correction: [176, 170, 0, 0, 0]
on + strand [176, 0, 0, 0, 0]
on - strand [0, 170, 0, 0, 0]
which means that we FINALLY have what we wanted and we can now go on, either with commenting the code further or passing these positions on to following steps of ePySCOPE.
---EDIT--- Will do a little output clean-up in a minute, but I wanted to commit the working state before it's gone.
The lines of interest were https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L759 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L798
https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1029-L1045 https://github.com/hzi-bifo/mdasim/blob/0fba2bf5212e13098e6799e6ce3e57a6a07374ba/src/mdasim.C#L1070-L1086
As you were saying, @VSack, that the positions currently reported by MDAsim seem very inconsistent, we should opt for simply substituting the current
position = <LPOF>
with something likeoriginal_position = <OP>
, as you suggested in an ePySCOPE discussion. Great that you already started on this!