pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

Missing k-mers? #64

Closed rwittler closed 1 year ago

rwittler commented 1 year ago

When building a graph with option -s, i.e., only considering k-mers at least appearing twice, the final number of k-mers considered is different from what I get with other tools. For instance, on the dataset SRR7167957 from the SRA, I get:

Bifrost: 6745326084 ("CompactedDBG::filter(): Found 6745326084 unique k-mers") KMC3: 6752490849 SANS: 6752490849

I used the current version (c156741) and the command that I used is: Bifrost build -v -s input.txt -o reads.bifrost -t 16 --colors

GuillaumeHolley commented 1 year ago

Hi Roland,

Long time no see :)

The log message of CompactesDBG::filter() is only an approximation of the number of unique k-mers in the input dataset and is not the final (exact) number of k-mers in the graph. I agree the message should clearly state it is an estimate and I'll change it in the next release.

Could you confirm the final number of k-mers in the graph with: awk '{if ($1=="S"){SUM+=length($3)-k+1}} END {print SUM}' graph.gfa

Guillaume

GuillaumeHolley commented 1 year ago

Hi @rwittler,

I'll close this for now but don't hesitate to reopen if need be.

Guillaume

rwittler commented 1 year ago

Hi Guillaume,

Thank you very much for your quick answer. Sorry for the late response. My families X-mas break was extended by SARS-Cov.

I tried the awk command that you suggested. It returned 417501438.

I also used the bifrost option to produce a fasta file and counted the 31-mers in there. Both KMC3 and SANS return the same value as above.

But this is quite different from the number of k-mers (appearing at least twice) reported by both KMC3 and SANS on the original input: 6752490849

The values differ by a factor of 16.

Any idea what else I could try?

Thanks, Roland

On 03.01.23 13:30, Guillaume Holley wrote:

Hi @rwittler https://github.com/rwittler,

I'll close this for now but don't hesitate to reopen if need be.

Guillaume

— Reply to this email directly, view it on GitHub https://github.com/pmelsted/bifrost/issues/64#issuecomment-1369714309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFO7S3GF24YXTRUTY2RQQP3WQQLVDANCNFSM6AAAAAATGZ3IM4. You are receiving this because you were mentioned.Message ID: @.***>

GuillaumeHolley commented 1 year ago

Hi Roland,

No worries :) I'll download the SRR7167957 and let you know what I find.

Guillaume

GuillaumeHolley commented 1 year ago

Hi Roland,

I think you might have misinterpreted the KMC3 output and the correct number of k-mers occurring twice or more in the dataset is indeed 417501438 as reported per Bifrost and KMC3.

First of all, from the SRA website, I could see that the dataset is 8.4 Gbp across 1.8M reads. 8.4 Gbp - 1.8M reads * 30 = 8.346 billion 31-mers in the dataset. Which means that in the absolute best case, there can only be 4.173 billion k-mers occurring twice or more in the dataset which is much less than 6,752,490,849.

I ran Bifrost and from the first lines in the log output:

CompactedDBG::build(): Estimated number of k-mers occurring at least once: 6747567378
CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 407928353

From those lines, I could see that the number of 6,752,490,849 was most likely the number of unique 31-mers in the graph rather than the number of k-mers occurring at least twice. I confirm that I also get 417,501,438 31-mers in my output graph.

Then, I ran KMC3: kmc -v -k31 -ci2 SRR7167957.fastq kmc_SRR7167957 and in the log output was:

...
No. of unique k-mers: 6752490849
No. of unique counted k-mers: 417501438
...

So all is good, KMC3 and Bifrost find the same number of 31-mers :)

Guillaume

rwittler commented 1 year ago

Hi Guillaume,

I am so sorry for bothering you with this. I don't know what has happened exactly, but also SANS reports 407928353 k-mers. So everything is consistent now - probably has been all the time. No problem at all.

Thank you for your patience.

Best, Roland

On 10.01.23 15:38, Guillaume Holley wrote:

Hi Roland,

I think you might have done a little error of interpretation with KMC3 and the correct number of k-mers occurring twice or more in the dataset is indeed 417501438 as reported per Bifrost and KMC3.

First of all, from the SRA website, I could see that the dataset is 8.4 Gbp across 1.8M reads. 8.4 Gbp - 1.8M reads * 30 = 8.346 billion 31-mers in the dataset. Which means that in the absolute best case, there can only be 4.173 billion k-mers occurring twice or more in the dataset which is much less than 6,752,490,849.

I ran Bifrost and from the first lines in the log output:

|CompactedDBG::build(): Estimated number of k-mers occurring at least once: 6747567378 CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 407928353 |

From those lines, I could see that the number of 6,752,490,849 was most likely the number of unique 31-mers in the graph rather than the number of k-mers occurring at least twice. I confirm that I also get 417,501,438 31-mers in my output graph.

Then, I ran KMC3: |kmc -v -k31 -ci2 SRR7167957.fastq kmc_SRR7167957| and in the log output was:

|... No. of unique k-mers: 6752490849 No. of unique counted k-mers: 417501438 ... |

So all is good, KMC3 and Bifrost find the same number of 31-mers :)

Guillaume

— Reply to this email directly, view it on GitHub https://github.com/pmelsted/bifrost/issues/64#issuecomment-1377377702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFO7S3EW6G63KYTNNWZ62UTWRVX7RANCNFSM6AAAAAATGZ3IM4. You are receiving this because you were mentioned.Message ID: @.***>