Closed abdo3a closed 4 months ago
Hi @abdo3a .
I haven't seen that error before. Can you still try the pipeline with the code from the fixes_for_prod
branch just in case it's a collateral of something else ? I've fixed quite a few bugs on that branch and was going to make the v0.3 release out of it.
If you're still seeing the issue, would you be able to check the input file – in this example /home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a/Ahemp.tsv
? Especially: does it have any data at all ? It feels like an earlier step produced an empty file, and we may need to trace those steps back.
Thanks @muffato for quick reply,
While trying the code from the fixes_for_prod
branch as you suggested but i got a new error from BLAST_BLASTN
process.
-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN (Ahemp)'
Caused by:
Process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN (Ahemp)` terminated with an error exit status (1)
Command executed:
if [ "false" == "true" ]; then
gzip -c -d Ahemp.chunks.fasta > Ahemp.chunks.fasta
fi
DB=`find -L ./ -name "*.nin" | sed 's/\.nin$//'`
blastn \
-num_threads 6 \
-db $DB \
-query Ahemp.chunks.fasta \
\
-outfmt '6 qseqid staxids bitscore std' -max_target_seqs 10 -max_hsps 1 -evalue 1.0e-10 -lcase_masking -dust '20 64 1' \
-out Ahemp.txt
cat <<-END_VERSIONS > versions.yml
"SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN":
blast: $(blastn -version 2>&1 | sed 's/^.*blastn: //; s/ .*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
Unable to find image 'quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0' locally
2.14.1--pl5321h6f7f691_0: Pulling from biocontainers/blast
642efca944a0: Already exists
bd9ddc54bea9: Already exists
bfa1a70cade6: Pulling fs layer
bfa1a70cade6: Verifying Checksum
bfa1a70cade6: Download complete
bfa1a70cade6: Pull complete
Digest: sha256:0fa116b90c6411d5b09cdda5ca81a857167d218c49915104e7e1588b16baedf7
Status: Downloaded newer image for quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0
USAGE
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
[-negative_taxidlist filename] [-entrez_query entrez_query]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[-gapopen open_penalty] [-gapextend extend_penalty]
[-perc_identity float_value] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value] [-penalty penalty]
[-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
[-template_type type] [-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-subject_besthit]
[-window_size int_value] [-off_diagonal_range int_value]
[-use_index boolean] [-index_name string] [-lcase_masking]
[-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format]
[-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
[-line_length line_length] [-html] [-sorthits sort_hits]
[-sorthsps sort_hsps] [-max_target_seqs num_sequences]
[-num_threads int_value] [-mt_mode int_value] [-remote] [-version]
DESCRIPTION
Nucleotide-Nucleotide BLAST 2.14.1+
Use '-help' to print detailed descriptions of command line arguments
========================================================================
Error: Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
Work dir:
/home/sharafa/Ahmep_genome/work/ad/bab0769e69a33818fbc5793659fa56
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
Hi again, i managed to solve the BLAST_BLASTN
process issue by using the blastn
module from the original branch but still struggling with BLOBTOOLKIT_WINDOWSTATS
, I think it's related to the yaml file. i created my own using the released genome example since the link for the draft genome example is not working.
Hi @abdo3a . What database are you using for blastn ? If the complete NT database, then you must use the version of the BLAST_BLASTN
module from the fixes_for_prod
branch. This is because very large databases have got many .nin
files that make the module confused.
Secondly, the Nextflow pipeline does not take the yaml file as input to configure its steps. It's only used to populate some fields like the taxonomy etc for the final blobdir. All the configuration is done via Nextflow parameters.
Hi @muffato,
Yes, i'm using the full NT database, but when i'm using the version of the BLAST_BLASTN
module from the fixes_for_prod
branch. it produces this error.
Error: Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
Regards ymal met data file, i noticed the comment about ignoring ymal file with Nextflow but when i tried running it without the ymal file it produced the following error:
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG (Ahemp)'
Caused by:
Process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG (Ahemp)` terminated with an error exit status (1)
Command executed:
btk pipeline \
generate-config \
Ahemp \
\
--reads Ahemp
cat <<-END_VERSIONS > versions.yml
"SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG":
blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
2024-01-31 11:03:34.557 [INFO] Fetching assembly metadata
Traceback (most recent call last):
File "/opt/conda/envs/btk_env/bin/btk", line 8, in <module>
sys.exit(cli())
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/btk/btk.py", line 80, in cli
sys.exit(subcommand())
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/btk/lib/pipeline.py", line 11, in main
cli("btk pipeline")
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/blobtoolkit_pipeline.py", line 52, in cli
sys.exit(subcommand(rename))
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/lib/generate_config.py", line 736, in main
meta = parse_assembly_meta(accession)
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/lib/generate_config.py", line 399, in parse_assembly_meta
root = ET.fromstring(xml)
File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/defusedxml/common.py", line 126, in fromstring
parser.feed(text)
File "/opt/conda/envs/btk_env/lib/python3.9/xml/etree/ElementTree.py", line 1717, in feed
self.parser.Parse(data, False)
TypeError: a bytes-like object is required, not 'NoneType'
Work dir:
/home/sharafa/Ahmep_genome/work/c9/da89ad80fa3fcc0d1d699592ddba96
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
1) Can you print the file .command.out
that is in the work directory of the BLAST_BLASTN
job ? There will be a line that starts with Using
and should be the name of the directory in which you have the NT database followed by /nt
.
2) TypeError: a bytes-like object is required, not 'NoneType'
makes sense in that context, since you have a pre-curation assembly. It's meant to be saying that Ahemp
is not a valid accession number. I'll raise an error upstream so that the tool prints a clearer error message.
3) Can you paste your sample-sheet ? I wonder if something's not being parsed correctly.
Ahemp
as TAG.sample,datatype,datafile
Ahemp,ont,/home/sharafa/Ahmep_genome/ont.cram
Hi @abdo3a . Sorry for the silence !
I've given the pipeline a go with a config file and managed to get it to work on a non-accessioned genome.
First, here is the minimal Yaml file I had to provide:
assembly:
level: bar
settings:
foo: 0
similarity:
diamond_blastx:
foo: 0
taxon:
class: class_name
family: family_name
genus: genus_name
kingdom: kingdom_name
name: species_name
order: order_name
phylum: phylum_name
superkingdom: superkingdom_name
taxid: 0
All those keys have to be present, but the values are ignored and don't matter. Everything else you would find in a typical BlobToolKit yaml file is superfluous.
Then, to run the pipeline, your command-line from https://github.com/sanger-tol/blobtoolkit/issues/91#issue-2107682062 should work, i.e.:
--yaml
parameter--accession
. That's used to name various files.--taxon
. That's used to select the relevant Busco lineages--input
. The one from https://github.com/sanger-tol/blobtoolkit/issues/91#issuecomment-1919090818 seems fineFor completeness, here is the command I've used for my tests:
nextflow run ~/workspace/tol-it/nextflow/sanger-tol/blobtoolkit -profile test,singularity --yaml $PWD/test.yml --accession draft
I mentioned the branch fixes_for_prod
in an earlier comment. This branch has now been merged into the dev
branch, and we'll make the 0.3.0 release out of it very soon.
Hello @muffato, Just reporting that the pipeline works now after following your suggestion. thanks and will close this
Super, thank you for confirming, @abdo3a .
The next version will simplify usage on draft assemblies
Hi There, i'm trying to run the nextflow pipeline on my draft genome using the command:
nextflow run sanger-tol/blobtoolkit -r 0.2.0 -resume -profile docker --input Ahemp.csv --fasta Ahemp_final.fasta --yaml Ahemp.yaml --accession Ahemp --taxon "Acropora hemprichii" --taxdump /databases/taxdump --blastp /databases/uniprot/reference_proteomes.dmnd --blastn /databases/20230316-ncbi/nt/nt --blastx /databases/uniprot/reference_proteomes.dmnd
But it always has the following error, any hints?