pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

Can we also extract the location of the unitigs? #35

Open rickbeeloo opened 3 years ago

rickbeeloo commented 3 years ago

I just took a look at the GFA output and noticed that, unlike tools such as seqwish, Bifrost does not output the paths corresponding to each node. Is it possible to obtain the locations of each of the unitigs in the original input sequences?

GuillaumeHolley commented 3 years ago

Right now, this is not available out of the box in the binary but I could implement that if there is a need for it.

rickbeeloo commented 3 years ago

I think it would be an awesome addition as this would allow us to quickly identify similar regions between genomes - like local MSAs - which for large sequence collections is infeasible without a graph-based approach. With the current implementation, we do not know the origins of the unitigs.

BlastFrost Can't this be implemented based on BlastFrost as this does output coordinates? As I just see also suggested here: https://github.com/pmelsted/bifrost/issues/3.

Pyfrost I also took a look at pyfrost however I'm not sure whether it would be possible using this tool. pyfrost only seems to record the positions of the individual k-mers within the unitig rather than extend this to the boundaries of the unitig, thus like: unitig, seq_id, start, end

lrvdijk commented 3 years ago

That's a disadvantage of De Bruijn graphs in general, you lose that kind of navigational data to reconstruct the original sequences that went in.

Our group is planning to implement links 1 on top of Bifrost, which will help with that, expect early release end of this year, or maybe early next year.

rickbeeloo commented 3 years ago

@lrvdijk aah interesting! Aren't the origins of the unitigs encoded in the color binary though? or is solely the presence of the individual k-mers within the untigs recorded?

ekg commented 3 years ago

That's a disadvantage of De Bruijn graphs in general, you lose that kind of navigational data to reconstruct the original sequences that went in.

Our group is planning to implement links 1 on top of Bifrost, which will help with that, expect early release end of this year, or maybe early next year.

Nice! This will be really useful. I was wondering why the method hadn't caught on.

GuillaumeHolley commented 3 years ago

Hey everyone,

I'm adding the feature to my todo list :) To answer your question @rickbeeloo, only the presence/absence of the individual k-mers is recorded in the color file.

rickbeeloo commented 3 years ago

Hi @GuillaumeHolley @lrvdijk , would it (for now) be possible to get the k-mer/unitig paths along the input genomes by querying the first part of the genome - let's say 5kb - and then traverse all edges (with the input genome color) till there are not edges left anymore (i.e. end of the genome)?

ekg commented 3 years ago

You could also map the unitigs back to your reference graph. This should tell you their locations.

On Fri, Oct 2, 2020 at 12:23 PM rickbeeloo notifications@github.com wrote:

Hi @GuillaumeHolley https://github.com/GuillaumeHolley @lrvdijk https://github.com/lrvdijk , would it (for now) be possible to get the k-mer/unitig paths along the input genomes by querying the first part of the genome - let's say 5kb - and then traverse all edges (with the input genome color) till there are not edges left anymore (i.e. end of the genome)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmelsted/bifrost/issues/35#issuecomment-702651216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIUU6Q2XSDFWNR3F5TSIWSZBANCNFSM4RQ4HRJQ .

rickbeeloo commented 3 years ago

@ekg Sorry I'm not sure what you mean with "reference graph"? We can get the unitgs from the Bifrost graph (e.g. via unitig-caller) and map them to the original input genome sequence(s) but you are talking about a graph?

ekg commented 3 years ago

Sorry, I meant reference genome. (I'm used to working with reference graphs.)

rickbeeloo commented 3 years ago

@ekg This indeed would work for large unitigs that can be unambiguously mapped - thus mostly highly conserved or accessory genes between a set of genomes. However, for genes in the middle (i.e. shared regions but variable parts) the unitigs will be short and linked via k-mers with different colors that cannot be unambiguously mapped to the reference genomes.

bioinformagica commented 2 years ago

Hello everyone, is this feature already implemented ? This would be very cool

GuillaumeHolley commented 2 years ago

As of now, this is not implemented in Bifrost but I was thinking more and more about doing it soon for reference sequences. Will push this on top of my todo list.

bioinformagica commented 2 years ago

Hi thanks for the quick reply, I'm really glad to hear it !

For a project I'm doing, I have to do a lot of gene alignments to create MSA gene reference plots using vg construct. My idea is to skip the slow process of doing gene alignments and create gfa files directly from the unaligned gene multifasta with bifrost build. But to do that, the final graph must have path information so I can do cool things like extract node depth, calculate distance between paths and create tables of present and absence of nodes.

GuillaumeHolley commented 2 years ago

Nice, glad to hear you have a cool project in mind. I started an implementation prototype in Bifrost and one question came up. According to the GFA1 spec, the Path line contains as 4th field an Optional comma-separated list of CIGAR strings which can just be a * (basically no CIGAR provided). Do you need the CIGAR strings? I could do it but it would make the everything a lot more complicated. I am just wondering if this would be needed for the common use case and would justify the extra computation time.

bioinformagica commented 2 years ago

Thanks !!

Nice, glad to hear you have a cool project in mind. I started an implementation prototype in Bifrost and one question came up. According to the GFA1 spec, the Path line contains as 4th field an Optional comma-separated list of CIGAR strings which can just be a * (basically no CIGAR provided). Do you need the CIGAR strings? I could do it but it would make the everything a lot more complicated.

No i don't need the cigar string, the * would be fine.

I am just wondering if this would be needed for the common use case and would justify the extra computation time.

Yeah maybe path info is not what most people want, maybe path info could be added with a optional --add-paths arg ?