Fragments that cannot be aligned with reference

VSack commented 6 years ago

I re-ran the test-script a few times. There are still fragments that cannot be aligned with the reference sequence. Since we changed nothing but the header information but left the sequences untouched, I assume this bug (?) or issue (?) or feature (?) has been present the whole time.

dlaehnemann commented 6 years ago

Thank you for checking and for raising this. Seeing from your reference to the issue in epyscope, this seems to be something really rare. So our output should be usable, but this is nevertheless troubling, as it might indicate some deeper bug. So I think my suggestion here would be:

Investigate this a bit further, while your memory of it is still fresh, and document this here in more detail (what commands you ran, what output you got). If this indicates a quick fix, great.
If not, we will leave this open as an issue here, for now, and continue with epyscope development. But we will then have a clearer documentation here, so coming back to it will be easier.

One general thought: I guess optimally at some point, someone should write unit tests for mdasim, although I don't currently see that we have the time in the short term -- but long-term this could also save a lot of time...

VSack commented 6 years ago

So I took the positionTesting.py-script and I ran a couple of tests on the output of the ePySCOPE-pipeline. Here are my findings so far:

All unaligned fragments are on the negative strand.
For all these fragments, their position and their length would cause them to go beyond the scope of the reference sequence
Error rates are highly variable. I got ratios from 2:3961 to 120:4230 out of 6 test runs on different datasets only.

Next I'll test if there are fragments, where the position and length would cause the fragment to go beyond the end of the reference sequence, but that can be aligned anyway (this shouldn't happen but if I learned something here, it's that we cannot test things that shouldn't happen enough...). If not, we still don't know where the error comes from, but we can catch it in MDAsim and throw away the faulty sequences.

VSack commented 6 years ago

It seems that those fragments where len(fragment) + pos(fragment) > len(ref sequence) is true, are the same that are unaligned. That means, if I'm not completely wrong, that we can test this at runtime and skip the faulty sequences. In this case however, the coverage requested by the user might not be reached in the output.

dlaehnemann commented 6 years ago

Thanks for checking these things in so much detail and reporting this here. This looks like a consistent picture, now: For some reason, current mdasim can copy beyond the actual reference sequence. As these are only rarer fragments, I wouldn't worry about them too much and filter them (externally) for now.

If we go about fixing this later, here is one consideration:

What is copied beyond the end of the reference sequence?
- Is it the start of the reference sequence? This might then be a feature, not a bug: Maybe the authors considered genomes to be circular, which is true for bacterial genomes on which MDA was originally done on. In this case we would need to turn this into an option whether the input is a circular DNA fragment or a linear one.

dlaehnemann commented 6 years ago

As you stated over at epyscope, @Vsack, we will actually have to solve this bug after all to get correct logging output.

As you stated, the next step should be to check whether the overflowing fragment part aligns anywhere, most probably to the start. If that is the case, we can infer that the original authors assumed circular genomes and should introduce an option to choose between circular and linear starting DNA. If not, I would just go for a fix assuming linear DNA and make sure no fragments beyond the starting DNA are generated.

VSack commented 6 years ago

I dove a little deeper into this and found a pattern. I have no idea where in the code this stems from, but it's something. If I do the following steps, the fragments in question can be aligned to the reference:

https://github.com/hzi-bifo/mdasim/blob/a95bd2b76c59ff003a31ce0d84fca5421cd7d770/examples/testing/positionTesting.py#L218-L233

So it seems that, when the copying process reaches the end of a reference sequence, the fragment that is generated at that moment is folded and copied backwards on itself or something like that. I'm sure a biologist could explain this a little more detailed.

The split fragments however do NOT align with the beginning of the reference - so no circles.

dlaehnemann commented 6 years ago

As discussed, this once again gives a clearer picture of the bug at hand. So I guess we should go bug-hunting in the code, now. As we are doing the error-logging on the fly, I think this will have to be fixed somewhere in the nucleotide-copying code, even though this might be more difficult than in the fragment cutting code at the end...

VSack commented 6 years ago

As we discussed last week, the fragments seem to be folded and copied backwards on itself. However, looking at the code, it seems like the sequence is some kind of continuous, all-sticking-together data structure which is only split at the very end during the output generation, based on whether or not it is 'occupied', e.g. still attached to its reverse sequence.

Thus, I have the feeling that preventing the software from "folding" the fragments (I still haven't found that piece of code) would cause some problems:

finding the source
maybe changing the data structure
coverage specified by the user might not be met if we terminate the copying process at the end of a fragment

Here's a bug fix that I came up with today: Split the "folded" fragments during output generation when the strand of the base changes!

Pros & Cons: + no unaligned fragments any more (I tested this locally and it works!) + no further code fumbling + coverage remains the same - a few "short" fragments, a few more fragments - we still don't really understand why this is happening or how \~ I'm not sure about the scientific accuracy, considering that we get more fragments, e.g. fragments that are not started by attaching a primer.

This would be a little bit of a quick'n'dirty solution but it gives a cleaner output than what we get now and we could continue with ePySCOPE.

dlaehnemann commented 6 years ago

Thanks for further investigating this.

To me, the problem sounds like some side-effect of copying beyond the end of the starting sequence in the data-structure, rather than something intended. So I am not sure, whether maybe this might even create extra coverage. But I guess we don't need to worry about +/- 1 in target coverage, anyways. What we want to make sure of, is that we have consistent output.

So while I agree that your solution is quick'n'dirty and will yield (very few) fragments that are not really warranted biologically, I also agree that it is probably good enough. And who knows, maybe we'll reimplement the whole thing in a different language (e.g. Rust;) at some point -- then we can make sure everything is correct and tested from the ground up. But for now let's try your current idea!

We will just have to make sure, that the output is then consistent, i.e.:

We need to get the starting position and the length of both the resulting fragments from such a "folded" fragment right. Not sure if that is easily caught and calculated in the splitting process?
We need to get the positions of introduced errors in the part beyond the starting sequence (i.e. beyond the "folding" point) right. As the errors are added to the log during the original copying (and not during the splitting), this probably requires an extra check for being beyond the original sequence length right there, and then projecting positions correctly before writing them? Am I explaining this understandably?

VSack commented 6 years ago

Consistency is ensured from all I see: The sequences might be attatched to each other for some reason (so positive strand - negative strand are connected (I think), that is why we get these weird fragments), BUT the base objects are still correct, which means, they yield the correct position relative to the reference strand and the direction of the strand they are on.

Reminder: Bases include a https://github.com/hzi-bifo/mdasim/blob/a95bd2b76c59ff003a31ce0d84fca5421cd7d770/include/mdasim.h#L65 that holds this information.

My workaround is pretty simple and I will upload it when I'm back in the office... Here, the software generates the fragment by pushing base by base into the read until a base is "occupied": https://github.com/hzi-bifo/mdasim/blob/a95bd2b76c59ff003a31ce0d84fca5421cd7d770/src/mdasim.C#L1014-L1024

In line 1020, I added a check if this current base and the following base are on the same strand. If so, the next base is added to the fragment, if not, the occupied-value changes and the fragment is completed.

There's also a similar code snippet in the same method dealing with the occupied part of the fragment, that has to be changed in the same manner.

The positionTesting.py-script checks whether the parameters of the fragment (position, strand) allow a direct alignment, which works.

The correct length of the fragment is ensured by https://github.com/hzi-bifo/mdasim/blob/a95bd2b76c59ff003a31ce0d84fca5421cd7d770/src/mdasim.C#L1049 So position, length and strand should be correct.

hzi-bifo / mdasim

Fragments that cannot be aligned with reference #20

Reminder: Bases include a https://github.com/hzi-bifo/mdasim/blob/a95bd2b76c59ff003a31ce0d84fca5421cd7d770/include/mdasim.h#L65 that holds this information.

There's also a similar code snippet in the same method dealing with the occupied part of the fragment, that has to be changed in the same manner.