Closed muffato closed 1 month ago
nf-core lint
overall result: Passed :white_check_mark:Posted for pipeline commit 8c70c77
+| ✅ 134 tests passed |+
#| ❔ 24 tests were ignored |#
black
) is failingTo keep the code consistent with lots of contributors, we run automated code consistency checks. To fix this CI test, please run:
black
: pip install black
black .
Once you push these changes the test should pass, and you can hide this comment :+1:
We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!
Thanks again for your contribution!
I've added some code to achieve the goal of taxonomiser_v2.py
, which is: find a taxon_id that is recognised by the NT database and the closest to the species of interest.
It's implemented very differently from the script. I leverage the taxonomy4blast.sqlite3
database that is shipped with NT and essentially lists the taxon_ids it knows about. If the species' taxon_id is not recognised, then it looks for the parent, etc.
As far as I understand the requirements, this is the last bit that was missing to complete support for draft assemblies. I'll mark this pull-request as ready.
@eeaunin . I've rebased this branch. It now includes the fixes I've made for blast
I had a closer look at how -negative_taxids
has been implemented in the Snakemake pipeline and it appears quite confusing. The BlobToolKit paper (https://academic.oup.com/g3journal/article/10/4/1361/6026202) says:
An optional filter excludes a configurable list of NCBI taxIDs (default: excludes query genus).
So the exclusion of taxids is supposed to be optional and configurable by the user.
BlobToolKit pipeline v1 has the mask_ids
setting for excluding taxids:
https://github.com/blobtoolkit/pipeline/blob/master/v1/example.yaml
However, I couldn't find a setting for the same thing in the Snakemake pipeline v2 code. Maybe the authors just forgot to include it?
In my runs with the Snakemake pipeline negative taxids were not used but there are suppressed error messages buried in the run logs relating to that. In a run with a Plasmodium yoelii yoelii assembly there is this error in the logs (/lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20230215_pyoelii_asg_cobiont_check_run/btk_busco/blastn/logs/pyoelii/run_blastn.log
):
BLAST Database error: Taxonomy ID(s) not found.Taxonomy ID(s) not found. This could be because the ID(s) provided are not at or below the species level. Please use get_species_taxids.sh to get taxids for nodes higher than species (see https://www.ncbi.nlm.nih.gov/books/NBK546209/).
Restarting blastn without taxid filter
So it ran into the error but then just quietly continued running. It is unclear to me what caused this error, as the taxid used there (352914) is at strain level.
In another run it has skipped using the taxid filter due to another error: /lustre/scratch123/tol/teams/grit/contamination_screen/icMagCera1/20240712_icMagCera1.20240711.hap1.fa_asg_cobiont_check_run/btk_busco/blastn/logs/icMagCera1.20240711.hap1.fa/run_blastn.log
BLAST Database error: Taxonomy filtering is not supported in v4 BLAST dbs
Restarting blastn without taxid filter
So the filtering doesn't work if the supplied database is V4 instead of V5 but this also doesn't crash the Snakemake pipeline and just produces an error message in the logs.
I guess it would be okay if the sanger-tol/blobtoolkit
pipeline used -negative_taxids
in all runs with draft assemblies as long as this doesn't produce frequent crashes. But I think it would be better if the use of -negative_taxids
was optional for draft assemblies.
@eeaunin . I've added a --skip_taxon_filtering
flag for you. It removes the taxon filtering from all Blast searches
I've rebased the branch onto the latest stable release 0.5.1
That's good then! I think it's fine to merge the draft_assemblies
branch to dev
now
On this branch, there is no input Yaml file. The only mandatory parameters are:
--taxon
)--fasta
)--input
) to list the read files--accession
is optional and is used to pull assembly information from ENA into the blobDir's meta.json.I haven't restructured the pipeline much. All the blobtools command at the end still require a yaml file. My solution is to add a script at the beginning of the pipeline that generates the minimal yaml file required (as per https://github.com/sanger-tol/blobtoolkit/issues/77#issuecomment-1936286274). It still allows clearly getting some parameters in the input-check sub-workflow and making the busco sub-workflow more focused on running buco + blastp.
Busco lineages are inferred from the taxonomy directly here. Like in the genome-note pipeline, I've moved away from using GoaT as GoaT is just a proxy to the NCBI taxonomy. This way, I can keep control of both the version of Busco and the list of lineages in the same place. I've also introduced the
--busco_lineages
parameter to allow precisely selecting the lineages that are used, rather than the taxonomy-based defaults.Still a draft for now as I want to review
/nfs/team135/yy5/btk_config/taxonomiser_v2.py
and maybe incorporate some elements of it.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).