Closed SchwarzMarek closed 1 month ago
Hi @SchwarzMarek,
- Kmerdb must be provided as
tar.gz
-> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without-resume
). I suggest to allow to pass the db directory in unpacked form.
The pipeline should work with the unpacked form as well. It only performs the untar operation when the kmerfinder database is provided intar.gz
format; otherwise, it uses the unpacked database directory directly. Thanks for pointing this out. I just realized that this is not clearly described in the documentation.
2. The only kmerdb, which I've found to work is exactly the one stated in the
bacass
documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz
) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions ofkmerfinder
dbs are deposited atftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/
, latest there appears to be from 10/2021 (also oldish). Even more recent is accessiblehttps://cge.food.dtu.dk/services/KmerFinder/
from 2022 (haven't tested yet, 63GB download).
Yes, there seems to be an issue with the Zenodo URL that worked with nf-core/bacass version v2.3.1. Have you tried using -r dev
? It should be fixed there (see related PRs in issue #155 ).
3. The need to provide
--ncbi_assembly_metadata
(which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in thebacteria.name
there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, assuppressed
records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there isbacteria.tax
which containassembly id
(also can be extracted frombacteria.name
col 3), which can be used in
Super interesting! Let me look into it next week (I'm currently trying to release version 2.4.0, as the dev
branch has many new implementations not present in 2.3.1).
I'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...
I do not have experience in writing
nextflow
pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.
Sure, I’m open to discussing this! I think it would definitely enhance the pipeline as well. We should open a specific issue for this and work on it there. What do you think?
Hi,
sorry for the delayed response. I'm testing the dev
version now.
ad 1) Thanks for clarification; when I've tried passing a directory It didn't work (I've used relative path); it works when I pass absolute path.
ad 2) The kmerdb
from 2022 works (tested on bacteria
portion of the database)
On the previous issue with canu (#164) I can confirm, that it now works in dev
.
Best Marek
Hi,
I have another note relating to kmerfinder
(and it's entirely possible, that it functions as intended, however, I find it confusing).
When I'm processing multiple genomes at once, kmerfinder
finds most similar genome: full results in Kmerfinder/[sample]/[sample]_results.txt
and summary in Kmerfinder/kmerfinder_summary.csv
. These genomes are then used for analysis with QUAST/runs_per_reference
.
The issue is that I see 4 different genomes in kmerfinder
summary but only 3 are downloaded and used with QUAST
. Moreover, for the sample whose reference genome does not match between kmerfinder
and QUAST
I cannot found the reference genome in the full kmerfinder
results (which I could understand as selecting most common references as intersects between the results).
I would expect to see references for QUAST
to be selected based on best
matches from kmerfinder
.
The zip file contains kmerfinder
full result for sample 82
(which is the one with unexpected quast reference); and quast log to show that sample 82 aligned with unexpected reference.
Best Marek
How do I pass the database if I have installed it via bash INSTALL.sh $KmerFinder_DB viral?
when I pass the path to my directory I installed it in I get an error. Any help is appreciated!
How do I pass the database if I have installed it via bash INSTALL.sh $KmerFinder_DB viral?
when I pass the path to my directory I installed it in I get an error. Any help is appreciated!
Hi,
not the author nor the maintainer of this, but it seems to me, that kmerfinder
db for bacteria is hardcoded, see https://github.com/nf-core/bacass/blob/c81202b7702c67d0e829085685083847b9e59435/modules/local/kmerfinder.nf#L27
Maybe, since this is bacteria assembly pipeline, it was not expected to work with viral databases.
Best MS
Hello @SchwarzMarek,
When I'm processing multiple genomes at once,
kmerfinder
finds most similar genome: full results inKmerfinder/[sample]/[sample]_results.txt
and summary inKmerfinder/kmerfinder_summary.csv
. These genomes are then used for analysis withQUAST/runs_per_reference
. [...]
The behavior you are experience might be related to how reference genomes are selected during the porcess. Summary of the methodology:
1. Kmerfinder process
KMERFINDER()
identifies the closest reference genome (The rop-ranked genome) for each sample.KMERFINDER()
module, there is a step where samples and their corresponding reference genome IDs (GCF IDs) are grouped together based on the species name of the reference genome:2. FIND_DOWNLOAD_REFERENCE process.
Once samples are grouped by species, the FIND_DOWNLOAD_REFERENCE()
module selects the most frequent reference genome (GCF) within each species group. This "winner" reference is chosen based on how often it appears in the group. The most frequent one is downloaded and will be assigned as reference GCF for each group of samples.
The reference genome (GCF file) with the most occurrences is then used as the reference genome in the QUAST_PER_REF()
analysis.
You may see four reference genomes listed in the KmerFinder summary, but only three reference genomes used by QUAST
. This happens because the FIND_DOWNLOAD_REFERENCE()
function consolidates multiple samples from the same species and assigns a single reference genome to that group (the one with the most occurrences). As a result, fewer references may be used in the QUAST analysis than the total number of species or genomes listed in the KmerFinder output.
Best
Hi @Daniel-VM , thanks for the clarification; this is the case for my data (semantic species match).
I've opened new issue for the reference genome download as suggested previously #172 and I'm closing this. Thanks for the answers :). Best MS
Description of feature
I very much appreciate the functionality implemented with kmerfinder (that is automatic search for close genome and running Quast with it). However I'm running into several issues with current implementation in
bacass
1) Kmerdb must be provided as
tar.gz
-> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without-resume
). I suggest to allow to pass the db directory in unpacked form. 2) The only kmerdb, which I've found to work is exactly the one stated in thebacass
documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz
) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions ofkmerfinder
dbs are deposited atftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/
, latest there appears to be from 10/2021 (also oldish). Even more recent is accessiblehttps://cge.food.dtu.dk/services/KmerFinder/
from 2022 (haven't tested yet, 63GB download). 3) The need to provide--ncbi_assembly_metadata
(which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file: a) in the zenodo db, in thebacteria.name
there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, assuppressed
records are still available albeit not present in the metadata table). b) in newer kmerfinder dbs there isbacteria.tax
which containassembly id
(also can be extracted frombacteria.name
col 3), which can be used inI'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...
I do not have experience in writing
nextflow
pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.