kmerfinder update and optimization

SchwarzMarek commented 2 months ago

Description of feature

I very much appreciate the functionality implemented with kmerfinder (that is automatic search for close genome and running Quast with it). However I'm running into several issues with current implementation in bacass

1) Kmerdb must be provided as tar.gz -> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without -resume). I suggest to allow to pass the db directory in unpacked form. 2) The only kmerdb, which I've found to work is exactly the one stated in the bacass documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions of kmerfinder dbs are deposited at ftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/, latest there appears to be from 10/2021 (also oldish). Even more recent is accessible https://cge.food.dtu.dk/services/KmerFinder/ from 2022 (haven't tested yet, 63GB download). 3) The need to provide --ncbi_assembly_metadata (which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in the bacteria.name there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, as suppressed records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there is bacteria.tax which contain assembly id (also can be extracted from bacteria.name col 3), which can be used in

    i) use `NCBI datasets` (online API docs https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api/#get-/genome/accession/-accessions-/download) 
    ii) nf-core module https://nf-co.re/modules/ncbigenomedownload/ (no experience here)
    iii)  rsync/wget/aspera download of the GCF/xxx/yyy/zzz directory

venn

I'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...

I do not have experience in writing nextflow pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.

Daniel-VM commented 2 months ago

Hi @SchwarzMarek,

Kmerdb must be provided as tar.gz -> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without -resume). I suggest to allow to pass the db directory in unpacked form.

The pipeline should work with the unpacked form as well. It only performs the untar operation when the kmerfinder database is provided intar.gzformat; otherwise, it uses the unpacked database directory directly. Thanks for pointing this out. I just realized that this is not clearly described in the documentation.

2. The only kmerdb, which I've found to work is exactly the one stated in the bacass documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions of kmerfinder dbs are deposited at ftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/, latest there appears to be from 10/2021 (also oldish). Even more recent is accessible https://cge.food.dtu.dk/services/KmerFinder/ from 2022 (haven't tested yet, 63GB download).

Yes, there seems to be an issue with the Zenodo URL that worked with nf-core/bacass version v2.3.1. Have you tried using -r dev? It should be fixed there (see related PRs in issue #155 ).

3. The need to provide --ncbi_assembly_metadata (which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in the bacteria.name there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, as suppressed records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there is bacteria.tax which contain assembly id (also can be extracted from bacteria.name col 3), which can be used in

Super interesting! Let me look into it next week (I'm currently trying to release version 2.4.0, as the dev branch has many new implementations not present in 2.3.1).

I'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...

I do not have experience in writing nextflow pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.

Sure, I’m open to discussing this! I think it would definitely enhance the pipeline as well. We should open a specific issue for this and work on it there. What do you think?

SchwarzMarek commented 1 month ago

Hi, sorry for the delayed response. I'm testing the dev version now.

ad 1) Thanks for clarification; when I've tried passing a directory It didn't work (I've used relative path); it works when I pass absolute path. ad 2) The kmerdb from 2022 works (tested on bacteria portion of the database)

On the previous issue with canu (#164) I can confirm, that it now works in dev.

Best Marek

SchwarzMarek commented 1 month ago

Hi, I have another note relating to kmerfinder (and it's entirely possible, that it functions as intended, however, I find it confusing).

When I'm processing multiple genomes at once, kmerfinder finds most similar genome: full results in Kmerfinder/[sample]/[sample]_results.txt and summary in Kmerfinder/kmerfinder_summary.csv. These genomes are then used for analysis with QUAST/runs_per_reference.

The issue is that I see 4 different genomes in kmerfinder summary but only 3 are downloaded and used with QUAST. Moreover, for the sample whose reference genome does not match between kmerfinder and QUAST I cannot found the reference genome in the full kmerfinder results (which I could understand as selecting most common references as intersects between the results).

I would expect to see references for QUAST to be selected based on best matches from kmerfinder.

The zip file contains kmerfinder full result for sample 82 (which is the one with unexpected quast reference); and quast log to show that sample 82 aligned with unexpected reference.

Kmerfinder.zip

Best Marek

anke-king commented 1 month ago

How do I pass the database if I have installed it via bash INSTALL.sh $KmerFinder_DB viral?

when I pass the path to my directory I installed it in I get an error. Any help is appreciated!

SchwarzMarek commented 1 month ago

How do I pass the database if I have installed it via bash INSTALL.sh $KmerFinder_DB viral?

when I pass the path to my directory I installed it in I get an error. Any help is appreciated!

Hi, not the author nor the maintainer of this, but it seems to me, that kmerfinder db for bacteria is hardcoded, see https://github.com/nf-core/bacass/blob/c81202b7702c67d0e829085685083847b9e59435/modules/local/kmerfinder.nf#L27

Maybe, since this is bacteria assembly pipeline, it was not expected to work with viral databases.

Best MS

Daniel-VM commented 1 month ago

Hello @SchwarzMarek,

When I'm processing multiple genomes at once, kmerfinder finds most similar genome: full results in Kmerfinder/[sample]/[sample]_results.txt and summary in Kmerfinder/kmerfinder_summary.csv. These genomes are then used for analysis with QUAST/runs_per_reference. [...]

The behavior you are experience might be related to how reference genomes are selected during the porcess. Summary of the methodology:

1. Kmerfinder process

KMERFINDER() identifies the closest reference genome (The rop-ranked genome) for each sample.
After processing all samples with the KMERFINDER() module, there is a step where samples and their corresponding reference genome IDs (GCF IDs) are grouped together based on the species name of the reference genome:

https://github.com/nf-core/bacass/blob/3f6a42da75e413a42de9a34fdf841110359a606d/subworkflows/local/kmerfinder_subworkflow.nf#L50-L62

2. FIND_DOWNLOAD_REFERENCE process.

Once samples are grouped by species, the FIND_DOWNLOAD_REFERENCE() module selects the most frequent reference genome (GCF) within each species group. This "winner" reference is chosen based on how often it appears in the group. The most frequent one is downloaded and will be assigned as reference GCF for each group of samples.
The reference genome (GCF file) with the most occurrences is then used as the reference genome in the QUAST_PER_REF()analysis.

You may see four reference genomes listed in the KmerFinder summary, but only three reference genomes used by QUAST. This happens because the FIND_DOWNLOAD_REFERENCE()function consolidates multiple samples from the same species and assigns a single reference genome to that group (the one with the most occurrences). As a result, fewer references may be used in the QUAST analysis than the total number of species or genomes listed in the KmerFinder output.

Best

SchwarzMarek commented 1 month ago

Hi @Daniel-VM , thanks for the clarification; this is the case for my data (semantic species match).

I've opened new issue for the reference genome download as suggested previously #172 and I'm closing this. Thanks for the answers :). Best MS

nf-core / bacass

kmerfinder update and optimization #170

Description of feature