[Q] Query sequences that are kmers of length 31

Louis-MG commented 1 year ago

Hello, I am trying to query kmers in 150 genomes, plus 4 files corresponding to kmers that are unique to 4 of these genomes. My queries already have a length of 31. I thought that I would not need to change the -e parameter, since the query sequences would be decomposed in kmers of length 31, wich results in the query themselves. I thought the outcomes would be:

the kmer from the query (aka the query itself) is absent: <0.8 of the kmers from the query are present: query is absent: 0
the kmer from the query (aka the query itself) is present: >0.8 of the kmers from the query are present, query is present: 1

But then the output.tsv contains only 1's, which is impossible, at least because a kmer unique to a genome cannot be found in other genomes. I checked if the unique kmers were actually unique, it seems that they are. I ran again my script and changed the -e 0.8 to -e 1. I now have a majority of 0's and kmers that are present in a kmers_unique_to_genome_X are also present only in the genome_X, which makes sense,

I feel like I misunderstood the option -e. I thought it means that if 80% of the kmers from the query are found, then it is present. If there is only 1 kmer in the query, does that parameter allow for an approximation of the kmer sequence ? E.g., it would allow for 6 mismtaches ? (0.2*31=~6)

GuillaumeHolley commented 1 year ago

Hi @Louis-MG,

Sorry about getting back to you about this issue only now, somehow I have missed the notification about this.

Your interpretation of the -e parameter is correct, you did not misinterpret it: using the default -e 0.8 means that if 80% of the kmers from the query are found then the query is present.

I suspect what is happening here is that Bifrost always round down the number of k-mers to find per query so finding 80% of a single k-mer would be rounded down to finding at least 0 k-mer... So when the query/k-mer is not found, it hits the minimum threshold for being found which obviously does not make any sense. I'll fix this in the next release of Bifrost coming soon. In the meantime, using -e 1.0 for queries that are k bases long will yield the expected result.

I'll keep the issue open and notify you when the new release is up.

Guillaume

GuillaumeHolley commented 1 year ago

Alright so the upcoming Bifrost version will be rounding to the nearest integer for the number of k-mer to find in a query. Which means that -e 0.8 on a single k-mer query should reports the query as present only if the k-mer is found (it is "rounded up" to one k-mer to find). However, using -e 0.3 would still round down which makes sense for multi-kmer queries but not for single-kmer queries (it does not makes sense to report a query as being found if its only k-mer is not found...). So there will be an additional rule which is that a query can only be reported as present if at least one of its kmer is found.

Louis-MG commented 1 year ago

Thank you for the answer !

Outside the scope of this issue, what other feature do you bring with the next release ?

GuillaumeHolley commented 1 year ago

Next release will mostly bring memory and speed improvements with up to 30% memory usage decrease on "larger" input (e.g. HG002 30x Illumina FASTQ but also HG38 FASTA). This will come at the expense of using a little bit of tmp disk and using a little bit more memory om small input dataset (the ones usually taking <1GB of RAM). That aside, the search subcommand will offer the option to report either the number of k-mers found per query or the ratio of found k-mer per query (rather than reporting a query as "present" or "absent" based on the usage of -e). The library libdivide will be replaced with library fastmod so compilation warnings that occur when compiling the current version will be gone too. The rest is a lot of code refactoring.

You can try it now if you would like, it is on the bifrost-robin branch. I'll merge it with main very soon. Note that this new version will break index file compatibility with previous versions (so this new version won't be able to read an index file (.bfi) generated prior to this new version).

Louis-MG commented 1 year ago

Thanks, closing the issue now !

pmelsted / bifrost

[Q] Query sequences that are kmers of length 31 #67