error with FASTAWINDOWS process

abdo3a commented 5 months ago

Hi There, i'm trying to run the nextflow pipeline on my draft genome using the command: nextflow run sanger-tol/blobtoolkit -r 0.2.0 -resume -profile docker --input Ahemp.csv --fasta Ahemp_final.fasta --yaml Ahemp.yaml --accession Ahemp --taxon "Acropora hemprichii" --taxdump /databases/taxdump --blastp /databases/uniprot/reference_proteomes.dmnd --blastn /databases/20230316-ncbi/nt/nt --blastx /databases/uniprot/reference_proteomes.dmnd

But it always has the following error, any hints?

-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)'

Caused by:
  Missing output file(s) `*_window_stats*.tsv` expected by process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)`

Command executed:

  btk pipeline window-stats \
          --in Ahemp.tsv \
          --window 0.1 --window 0.01 --window 1 --window 100000 --window 1000000 \
          --out Ahemp_window_stats.tsv

  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS":
      blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

muffato commented 5 months ago

Hi @abdo3a . I haven't seen that error before. Can you still try the pipeline with the code from the fixes_for_prod branch just in case it's a collateral of something else ? I've fixed quite a few bugs on that branch and was going to make the v0.3 release out of it.

If you're still seeing the issue, would you be able to check the input file – in this example /home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a/Ahemp.tsv ? Especially: does it have any data at all ? It feels like an earlier step produced an empty file, and we may need to trace those steps back.

abdo3a commented 5 months ago

Thanks @muffato for quick reply, While trying the code from the fixes_for_prod branch as you suggested but i got a new error from BLAST_BLASTN process.

-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN (Ahemp)'

Caused by:
  Process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN (Ahemp)` terminated with an error exit status (1)

Command executed:

  if [ "false" == "true" ]; then
      gzip -c -d Ahemp.chunks.fasta > Ahemp.chunks.fasta
  fi

  DB=`find -L ./ -name "*.nin" | sed 's/\.nin$//'`
  blastn \
      -num_threads 6 \
      -db $DB \
      -query Ahemp.chunks.fasta \
       \
      -outfmt '6 qseqid staxids bitscore std' -max_target_seqs 10 -max_hsps 1 -evalue 1.0e-10 -lcase_masking -dust '20 64 1' \
      -out Ahemp.txt

  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:RUN_BLASTN:BLAST_BLASTN":
      blast: $(blastn -version 2>&1 | sed 's/^.*blastn: //; s/ .*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Unable to find image 'quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0' locally
  2.14.1--pl5321h6f7f691_0: Pulling from biocontainers/blast
  642efca944a0: Already exists
  bd9ddc54bea9: Already exists
  bfa1a70cade6: Pulling fs layer
  bfa1a70cade6: Verifying Checksum
  bfa1a70cade6: Download complete
  bfa1a70cade6: Pull complete
  Digest: sha256:0fa116b90c6411d5b09cdda5ca81a857167d218c49915104e7e1588b16baedf7
  Status: Downloaded newer image for quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0
  USAGE
    blastn [-h] [-help] [-import_search_strategy filename]
      [-export_search_strategy filename] [-task task_name] [-db database_name]
      [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
      [-negative_gilist filename] [-negative_seqidlist filename]
      [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
      [-negative_taxidlist filename] [-entrez_query entrez_query]
      [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
      [-subject subject_input_file] [-subject_loc range] [-query input_file]
      [-out output_file] [-evalue evalue] [-word_size int_value]
      [-gapopen open_penalty] [-gapextend extend_penalty]
      [-perc_identity float_value] [-qcov_hsp_perc float_value]
      [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
      [-xdrop_gap_final float_value] [-searchsp int_value] [-penalty penalty]
      [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
      [-template_type type] [-template_length int_value] [-dust DUST_options]
      [-filtering_db filtering_database]
      [-window_masker_taxid window_masker_taxid]
      [-window_masker_db window_masker_db] [-soft_masking soft_masking]
      [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
      [-best_hit_score_edge float_value] [-subject_besthit]
      [-window_size int_value] [-off_diagonal_range int_value]
      [-use_index boolean] [-index_name string] [-lcase_masking]
      [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format]
      [-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
      [-line_length line_length] [-html] [-sorthits sort_hits]
      [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
      [-num_threads int_value] [-mt_mode int_value] [-remote] [-version]

  DESCRIPTION
     Nucleotide-Nucleotide BLAST 2.14.1+

  Use '-help' to print detailed descriptions of command line arguments
  ========================================================================

  Error: Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
  Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: Ahemp.chunks.fasta

Work dir:
  /home/sharafa/Ahmep_genome/work/ad/bab0769e69a33818fbc5793659fa56

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

abdo3a commented 5 months ago

Hi again, i managed to solve the BLAST_BLASTN process issue by using the blastn module from the original branch but still struggling with BLOBTOOLKIT_WINDOWSTATS, I think it's related to the yaml file. i created my own using the released genome example since the link for the draft genome example is not working.

muffato commented 5 months ago

Hi @abdo3a . What database are you using for blastn ? If the complete NT database, then you must use the version of the BLAST_BLASTN module from the fixes_for_prod branch. This is because very large databases have got many .nin files that make the module confused.

Secondly, the Nextflow pipeline does not take the yaml file as input to configure its steps. It's only used to populate some fields like the taxonomy etc for the final blobdir. All the configuration is done via Nextflow parameters.

abdo3a commented 5 months ago

Hi @muffato, Yes, i'm using the full NT database, but when i'm using the version of the BLAST_BLASTN module from the fixes_for_prod branch. it produces this error.

Error: Too many positional arguments (1), the offending value: Ahemp.chunks.fasta
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: Ahemp.chunks.fasta

Regards ymal met data file, i noticed the comment about ignoring ymal file with Nextflow but when i tried running it without the ymal file it produced the following error:

ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG (Ahemp)'

Caused by:
  Process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG (Ahemp)` terminated with an error exit status (1)

Command executed:

  btk pipeline \
      generate-config \
      Ahemp \
       \
      --reads Ahemp

  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:INPUT_CHECK:BLOBTOOLKIT_CONFIG":
      blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  2024-01-31 11:03:34.557 [INFO] Fetching assembly metadata
  Traceback (most recent call last):
    File "/opt/conda/envs/btk_env/bin/btk", line 8, in <module>
      sys.exit(cli())
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/btk/btk.py", line 80, in cli
      sys.exit(subcommand())
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/btk/lib/pipeline.py", line 11, in main
      cli("btk pipeline")
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/blobtoolkit_pipeline.py", line 52, in cli
      sys.exit(subcommand(rename))
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/lib/generate_config.py", line 736, in main
      meta = parse_assembly_meta(accession)
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/lib/generate_config.py", line 399, in parse_assembly_meta
      root = ET.fromstring(xml)
    File "/opt/conda/envs/btk_env/lib/python3.9/site-packages/defusedxml/common.py", line 126, in fromstring
      parser.feed(text)
    File "/opt/conda/envs/btk_env/lib/python3.9/xml/etree/ElementTree.py", line 1717, in feed
      self.parser.Parse(data, False)
  TypeError: a bytes-like object is required, not 'NoneType'

Work dir:
  /home/sharafa/Ahmep_genome/work/c9/da89ad80fa3fcc0d1d699592ddba96

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

muffato commented 5 months ago

1) Can you print the file .command.out that is in the work directory of the BLAST_BLASTN job ? There will be a line that starts with Using and should be the name of the directory in which you have the NT database followed by /nt.

2) TypeError: a bytes-like object is required, not 'NoneType' makes sense in that context, since you have a pre-curation assembly. It's meant to be saying that Ahemp is not a valid accession number. I'll raise an error upstream so that the tool prints a clearer error message.

3) Can you paste your sample-sheet ? I wonder if something's not being parsed correctly.

abdo3a commented 5 months ago

i can re-produce this error again. maybe i used the wrong branch.
Then what i should use for a draft genome, i used Ahemp as TAG.

Here is my sample-sheet:

sample,datatype,datafile
Ahemp,ont,/home/sharafa/Ahmep_genome/ont.cram

muffato commented 4 months ago

Hi @abdo3a . Sorry for the silence !

I've given the pipeline a go with a config file and managed to get it to work on a non-accessioned genome.

First, here is the minimal Yaml file I had to provide:

assembly:
  level: bar
settings:
  foo: 0
similarity:
  diamond_blastx:
    foo: 0
taxon:
  class: class_name
  family: family_name
  genus: genus_name
  kingdom: kingdom_name
  name: species_name
  order: order_name
  phylum: phylum_name
  superkingdom: superkingdom_name
  taxid: 0

All those keys have to be present, but the values are ignored and don't matter. Everything else you would find in a typical BlobToolKit yaml file is superfluous.

Then, to run the pipeline, your command-line from https://github.com/sanger-tol/blobtoolkit/issues/91#issue-2107682062 should work, i.e.:

Provide this Yml file to the --yaml parameter
Give the name of your assembly as --accession. That's used to name various files.
Give the name of your species as --taxon. That's used to select the relevant Busco lineages
Give your samplesheet as --input. The one from https://github.com/sanger-tol/blobtoolkit/issues/91#issuecomment-1919090818 seems fine

For completeness, here is the command I've used for my tests:

nextflow run ~/workspace/tol-it/nextflow/sanger-tol/blobtoolkit -profile test,singularity --yaml $PWD/test.yml --accession draft

I mentioned the branch fixes_for_prod in an earlier comment. This branch has now been merged into the dev branch, and we'll make the 0.3.0 release out of it very soon.

abdo3a commented 4 months ago

Hello @muffato, Just reporting that the pipeline works now after following your suggestion. thanks and will close this

muffato commented 4 months ago

Super, thank you for confirming, @abdo3a .

The next version will simplify usage on draft assemblies

sanger-tol / blobtoolkit

error with FASTAWINDOWS process #91