jermp / fulgor

Fulgor is a fast and space-efficient colored de Bruijn graph index.
MIT License
43 stars 9 forks source link

Feature request: creates a distinct color for each sequence in the input file #27

Open Malfoy opened 5 months ago

Malfoy commented 5 months ago

Hi @jermp ! Could you provide some option similar to Themisto one to handle one fasta file that contain many sequence that should be considered of different colors?

-e, --sequence-colors Default if the input has just a single sequence file. Creates a distinct color 0,1,2,... for each sequence in the input.

jermp commented 5 months ago

Hi @Malfoy, yes, this could be done in principle. I have to understand what support (if any) from GGCAT we have for this though. I do not think we are going to implement it soon since our focus is currently on a different aspect, but happy to collaborate for this feature if you or your students are willing to try it. Please, let me know!

Best, -Giulio

jnalanko commented 3 months ago

Chiming in: In Themisto, this option uses my old construction algorithm that uses my own external memory sorting algorithm. It's very slow compared to ggcat.

While ggcat does not directly support this the last time I checked, in principle it would be possible to split the input into one file per sequence, and feed that to ggcat. But this might create millions of files, so this is not ideal.

jermp commented 3 months ago

Thanks Jarno for your input. Yes, splitting each file into several sequences would be the way to go but, as you said, looks like an overkill. I think one approach would be to keep the indexes as they are and add some metadata (compressed?) to indicate the sequence-ids, and not just the color. A sort of hierarchical color scheme, where we have super-colors (the original colors) and colors (now, the sequence ids). Does it make sense? CC: @rob-p

rob-p commented 3 months ago

Hi all! @jermp; I think that this will be an increasingly common use-case, so it's worthwhile to figure out a principled way to do it. If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

In fact, we have a use-case right now where I think fulgor (meta-colored dBG) would be perfect, but all of our input is middle-length sequences in a single file. We want to build an index on one file with ~1M sequences, and another with ~26M sequences, and there's no way we want to split those into 1 file / sequence just to build the index (happy to explain this specific use-case in more detail over e-mail / chat if you'd like).

jermp commented 3 months ago

If the additional metadata results in a final index as efficient as if we had split the original into many files, then this seems like a practical way to go.

The idea would be to have something even more efficient. Splitting everything would result in many small lists and would be hard to compress unless they have some special properties (which they might have since they come from the same file, hence "correlated" somehow). Always happy to chat, of course!