pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
204 stars 25 forks source link

Unnecessary modification of data in neighborIterator? #17

Closed lrvdijk closed 4 years ago

lrvdijk commented 4 years ago

In NeighborIterator.tcc, here: https://github.com/pmelsted/bifrost/blob/master/src/NeighborIterator.tcc#L37 the UnitigMap object is modified at each iteration (if it exists in the graph). It's unclear to me, however, why this modification is done? You'd expect the data from find should be correct right?

The reason I'm asking is because the following test case now fails: https://github.com/broadinstitute/pyfrost/blob/master/tests/test_mccortex.cpp#L63

If you iterate over the successors of a unitig, and compare the UnitigMap object obtained through the iterator to a UnitigMap object from a known successor of the same unitig obtained earlier using find, all == comparisons will return false because the len attribute is different.

If I comment out line 37 and 38 in NeighborIterator.tcc my test case succeeds.

Is there a reason why the len and dist attributes get modified?

Thanks!

GuillaumeHolley commented 4 years ago

Hey Lucas,

Thank you for your message. The modification is necessary and your test case failing is actually normal.

In your test case, when you use find(), the UnitigMap object returned is the mapping of a single k-mer, your query, on a unitig of the graph. If you call UnitigMap::mappedSequenceToString() on that object, it is going to print a single k-mer (conditioned to UnitigMap::isEmpty == false) because UnitigMap::len is 1. Now, when you use getSuccessors(), the UnitigMap object returned is always a mapping of the entire successor's unitig on... itself. It makes sense because you want to get the entire sequence of the successor and not the first or last k-mer of that unitig. Hence, if you call UnitigMap::mappedSequenceToString() on it, it is going to print the complete unitig of the successor in the right "direction" (forward or reverse-complement) because the graph is bi-directed. Your test case is failing because you are comparing the mapping of a single k-mer to the mapping of an entire unitig. Yet, if you compare both UnitigMap::referenceUnitigToString(), you will see that both mappings are to the same unitig.

The code located here https://github.com/pmelsted/bifrost/blob/master/src/NeighborIterator.tcc#L37 shows that internally, when we query for the successors or predecessors of a unitig, we actually query for k-mers in the graph (each unitig has only 4 possible successors and predecessors). And because the UnitigMap objects returned are mapping of a single k-mer, we must modify them such that the mapping matches the entire unitig.

I hope this is a little bit more clear but let me know if it is not and I will try to explain this a little bit better ;)

Guillaume

lrvdijk commented 4 years ago

Aha, thanks, seems my understanding was incomplete! Thanks for the explanation, this clears a lot up!

GuillaumeHolley commented 4 years ago

No problem ;) I'm closing this issue then.