jermp / fulgor

Fulgor is a fast and space-efficient colored de Bruijn graph index.
MIT License
43 stars 8 forks source link

How does fulgor handle multi-mappers? #15

Closed jeremymsimon closed 11 months ago

jeremymsimon commented 1 year ago

Hi- I was curious if you could help me understand what happens in the case of multi-mappers, ie a given sequence that appears multiple times in the same reference. In your example with 4,546 Salmonella genomes, my guess is that there are repetitive sequences within a given genome - is the pseudo-alignment only capable of saying whether or not a given sequence occurred (n >= 1 times) or is it possible to know the exact number of times there was a match within a genome, or even more specifically, where the match(es) occurred?

rob-p commented 1 year ago

Hi @jeremymsimon,

Fulgor (the current version at least) is a purely color-based index. Hence, there is no positional information stored about where a sequence's constituent k-mers map. Therefore, the interpretation of a mapping from fulgor is exactly as you suggest — it signifies that the sequence likely occurred (n >= 1 times) in the reported reference. It would potentially be possible to think about indexing abundance information (i.e. number of occurrences of a unitig in each reference) or approximate position information, but the current index is not focused on that use case. Is there an application where e.g. the number of occurrences would be useful to you? Or the approximate positions? If you want the full positions, then I think you'd have to fall back to a full positional index, which is quite powerful, but generally much larger.

Best, Rob

jeremymsimon commented 1 year ago

Hey @rob-p! For the particular application I'm playing around with, knowing if a given sequence matches at all is certainly useful, but it would be great to know how many times it matched within a certain number of mismatches. I was able to install and run fulgor easily and was quickly able to do some testing though, so it seems great in that regard!

I played around with pufferfish also but given my application, I couldn't get an index generated efficiently without downsampling the references first. I have for now settled on an approach to pre-select genomic intervals of interest, then creating an index either per individual, or one mega-index with each individual coded as colors in the ccdBG (ie columns in the fulgor output)

Also, this may be easiest to describe via a call to step through some data, but both fulgor and pufferfish seem to occasionally miss known occurrences of sequences in a somewhat kmer-length-dependent fashion. Briefly, most known matches are found easily, but some may drop out at k=X and reappear at k=X+2 as though the seed got split and thus not found. This could be an artifact of my short query sequences and thus short k-mers in the index (k=11 or 13), so if this is an expected phenomenon given that, it would be good to know if there are any possible workarounds without introducing too many spurious matches.

If fulgor were able to account for multiple matches and we were able to work around the above issue, I think it may be perfect for my specific use here. I'm working on a minimally reproducible example dataset as well and can share that when it's ready

Thanks!

rob-p commented 1 year ago

Hi @jeremymsimon,

Very interesting! We'd certainly be interested in investigating any minimal reproducible examples you find. There is a filtering that happens with highly-repetitive seeds in pufferfish, but there shouldn't be in fulgor (depending on the type of mapping options you pass). That is, if you use "strict intersection" or "threshold union" methods, they are the naive implementations and don't otherwise have any filtering or thresholding of repetitive seeds.

If you're interested in a more memory efficient index that supports positional information, you may be interested in looking at piscem — which is very much like pufferfish swapping out the previous representation of the k-mer to unitig mapping with sshash. However, by virtue of having to store all of the positional information, it will still likely be substantially larger than fulgor.

Best, Rob

jermp commented 1 year ago

Hi @jeremymsimon, thank you for your interest in Fulgor! It looks like you would perhaps need Pufferfish2 if you're interested in positional information (exact kmer location) rather than just its color, although I would first runs Fulgor to see its performance. As @rob-p mentioned, we're more then happy to assist you with a smaller dataset at hand! Please, let us know.

Cheers, -Giulio

jeremymsimon commented 1 year ago

Thanks @jermp and @rob-p - I'll hopefully have a small scale dataset to illustrate this later this week!

jermp commented 1 year ago

Hi @jeremymsimon, any news regarding the dataset?

jeremymsimon commented 1 year ago

Yes, so sorry for the delay. I've attached here a directory containing a set of query (queries.fasta) and target sequences (*.minExampleSubset.fasta), as well as the code I've used in testing for each of pufferfish, fulgor, vargas, and blat (*.sh). I've been doing a bit of a bake-off comparing and contrasting the different methods to find which may be the most appropriate choice here given our question but also the limitations of having such short queries (and thus, very short k-mers)

The blat output contains what I'd consider the "expected" case, and I'm comparing the results of the other methods to that and noting how and where they differ.

Generally:

My hope is that this setup is mostly self explanatory once you dive into the directory and scripts, but absolutely feel free to ask here or by email if you'd like me to clarify anything or provide more context.

Thanks in advance!

MinimalReprex_081423.tgz

jermp commented 11 months ago

Hi @jeremymsimon and sorry for the late reply but it's a very busy period here (as usual!). Your setup looks very interesting and, as we already said, we're both (I and @rob-p) happy to assist you in your research if you use Fulgor.

I do not know exactly what you're searching for but I can try to answer some questions:

fulgor may also be heuristic in that it can't guarantee to find a match (?).

Fulgor might have false positives but never false negatives. That is, a sequence might be falsely reported as a match because its kmers are found in the index whereas it does not really occur in any of the indexed genomes. This is a limitation of all kmer-based methods (not only of Fulgor). This is why a longer kmer length is probably useful for. You mentioned you use k=11 or 13, but we would use k=28 or 31 (current default value).

But Fulgor will never have false negatives, i.e., given a true positive query, all its kmers will be found in the index, hence the query will always be reported as present in the index.

when the sequence gets bisected/split across two k-mers, thus you sometimes get a match at certain k but miss it at other k.

Mmh I'm not sure I understand what you mean by "the sequence gets split across two k-mers". Can you please clarify or provide a toy example?

As Rob mentioned, pufferfish can be used with some filtering option, but we don't use any filtering in Fulgor, hence any indexed kmer must always be found at query time.

Hope this clarifies a bit! Please let me know.

Best, -Giulio

rob-p commented 11 months ago

Hi @jeremymsimon,

Just some notes about what you're likely seeing in pufferfish; it implements several heuristics to avoid having to repeatedly handle highly-repetitive matches during chaining. This can lead do dropping some hits if they occur in super-repetitive regions. At some point, we exposed variables to tweak these from the command line, but they don't have highly informative names (as they were more meant for developer tweaking than end-user use).

Regardless, if it turns out you really do need positional information for your hits, then one place to look might be piscem-cpp and the piscem wrapper. This is our next-generation positional index, and it builds upon @jermp's excellent sshash as the k-mer to unitig mapping, and so it's considerably more memory frugal than pufferfish. We are actively developing and improving it, and so are happy to consider feature requests that might help you.

If you don't need positional information, as Giulio suggests, there is no filtering currently done in fulgor. However, you won't get positional information (it doesn't compute this), and it also doesn't keep track of how many times in each reference a given k-mer occurs. Specifically, fulgor considers the "color" of a unitig as the distinct set of references in which the unitig occurs, regardless of the number of occurrences. This is what it considers when performing the mapping.

Finally, I'll add one other point to the excellent description Giulio gives above. When you are looking up the mapping information for reads that consist of more than one k-mer, then the result you get can depend heavily on which specific mapping algorithm you use. The strict pseudoalignment algorithm requires that, to report a reference r_i as being a mapping for the read, then for every k-mer k_j that occurs in the index at all, k_j must be contained within r_i. That is, if you have a read where 100 k-mers match the index, and r_i contains 99 of them, but the last k-mer only maps to, say r_{i+1}, then your read will not be reported as a mapping to r_i (and may not be reported as mapping at all unless the other 99 k-mers also match to r_{i+1}. The non-strict intersection lets you control the threshold of k-mers that match in the index that must be present in a reference to report that reference as a mapping for the read.

--Rob

jeremymsimon commented 11 months ago

Thanks @rob-p and @jermp ! This is helpful. So my search has relatively short queries, on the order of 20-25bp as opposed to typical applications (e.g. RNA-seq reads) that are longer. Do you still recommend a k-mer length of 28 or 31? It isn't explicitly mentioned this way, but it seemed there was a pattern that the recommended k-mer length was roughly half the size of the query sequences (ie 28-31bp works for 50-75bp reads) but maybe that was me over-interpreting. And can you similarly provide guidelines for the minimizer parameter as I don't see this recommendation anywhere?

In the files I shared, there were example queries that behaved or didn't behave as we expected, and some variability from algorithm to algorithm, but before saying more it may be due to a poor choice in k-mer/minimizer lengths so it would be helpful to clarify that first.

If a known match is found with 100% confidence, then we likely can get away without knowing the locations, and in fact this may be preferable for our applications as it will obscure any potentially identifiable information in the results.

We do however need to know if a given match occurred more than once, and ideally how many times within a certain distance/number of mismatches. It sounds like fulgor can't do that based on your description, unfortunately, but my guess is a lot of the above (including best practices for k-mer length) will apply to pufferfish and piscem as well

Thanks!

jermp commented 11 months ago

Hi @jeremymsimon,

So my search has relatively short queries, on the order of 20-25bp as opposed to typical applications (e.g. RNA-seq reads) that are longer.

Well, I'd say 20-25 bp is even shorter than a single kmer....so kmer-based indexes using k < 20 might yield poor results in terms of accuracy. For the minimizer length: actually there is no good rule. It does really depend on the data you are indexing. Usually I use m=17 for less fragmented dBGs and m=19 or 20 for highly fragmented dBGs but that is for k=31 :) I never tried k=20, but I guess something like m=13 could work well.

We do however need to know if a given match occurred more than once, and ideally how many times within a certain distance/number of mismatches. It sounds like fulgor can't do that based on your description, unfortunately, but my guess is a lot of the above (including best practices for k-mer length) will apply to pufferfish and piscem as well.

Correct, Fulgor does not encode any multiplicity, whereas pufferfish2 or piscem have everything :)

jeremymsimon commented 11 months ago

Thanks @jermp - so is the general rule of thumb for the k-mer size then to be as long as your shortest possible query length? ie if I have a range of query sizes from 20-25bp, you'd recommend k=20 but if it was a range of 25-30 you'd recommend k=25?

jermp commented 11 months ago

You're very welcome @jeremymsimon!

so is the general rule of thumb for the k-mer size then to be as long as your shortest possible query length?

I do not know if it is a rule of thumb but I'd say "no" because, in general, one builds an index without assuming specific query workloads nor distributions.

if I have a range of query sizes from 20-25bp, you'd recommend k=20 but if it was a range of 25-30 you'd recommend k=25?

For these values of k yes, it is a reasonable choice. But I'm surprised you have so short queries! In general, if your shortest query length is, say, 200, I'd not recommend k=200 but k=31 or k=63.

jeremymsimon commented 11 months ago

Thanks @jermp and apologies for the delay, I was out of town. I'm building this all for a very specific gRNA sequence screen, so the queries are short and fixed at that length, rather than a typical usage for sequencing reads or similar.

I will repeat my testing with k=20, and try your other suggestions above (including piscem and others). I'll close this for now and reopen if I have any further questions!

Thanks for all of your help!

jermp commented 11 months ago

No problem at all @jeremymsimon. Keep us posted, we're both very curious about this experimentation. Best, -Giulio