jhuapl-bio / taxtriage

TaxTriage is a Nextflow workflow designed to agnostically identify and classify microbial organisms within short- or long-read metagenomic NGS data. This flexible tool was developed with various use-cases of mNGS in mind.
MIT License
18 stars 4 forks source link

top_hits.nf seems to have wrong number of arguments #28

Closed hkunerth closed 7 months ago

hkunerth commented 11 months ago

Description of the bug

It seems like some recent changes to the top hits report generation may have introduced some sort of mis-specified array:

Here's my error:

RROR nextflow.extension.OperatorImpl - @unknown groovy.lang.MissingMethodException: No signature of method: Script_861716d0$_runScript_closure1$_closure2$_closure22.call() is applicable for argument types: (ArrayList) values: [[[id:230029461_WB, single_end:false, platform:ILLUMINA, fastq_1:/home/mdh/shared/taxtriage/231005_test/samples/230029461.Illumina.kraken.dehosted.1.fastq.gz, ...], ...]]

The log file truncates it but grabbing it from the slurm output:

ERROR ~ Invalid method invocation call with arguments: [[id:230029461_WB, single_end:false, platform:ILLUMINA, fastq_1:/home/mdh/shared/taxtriage/231005_test/samples/230029461.Illumina.kraken.dehosted.1.fastq.gz, fastq_2:/home/mdh/shared/taxtriage/231005_test/samples/230029461.Illumina.kraken.dehosted.2.fastq.gz, trim:false, directory:false, sequencing_summary:null], /panfs/jay/groups/32/mdh/shared/taxtriage/231005_test/work/66/89eb4288f373a6ad78f6d9d8079efd/230029461_WB.top_report.tsv, [/panfs/jay/groups/32/mdh/shared/taxtriage/231005_test/work/a0/587a499c14235a2d369c0e418fca23/230029461_WB.classified_1.fastq.gz, /panfs/jay/groups/32/mdh/shared/taxtriage/231005_test/work/a0/587a499c14235a2d369c0e418fca23/230029461_WB.classified_2.fastq.gz], /panfs/jay/groups/32/mdh/shared/taxtriage/231005_test/work/4e/923481f65a775d1ffa8f53afbea9bb/230029461_WB.output.references.fasta] (java.util.ArrayList) on _closure22 type

I haven't had a chance to do much digging, but the addition of the $2 variable in the top_hits.nf module might be breaking things?

Command used and terminal output

nextflow run /home/mdh/shared/software_modules/taxtriage/1.2.0/main.nf -c /home/mdh/shared/software_modules/taxtriage/1.2.0/mdh.config --input samples//Samplesheet.csv --db /home/mdh/shared/software_modules/kraken/kraken2_databases/k2_standard_230605/ --outdir tt_out --email henry.kunerth@state.mn.us --tmpdir /tmp --remove_taxids '"9606"' --max_memory 248GB --max_cpus 16 --skip_assembly FALSE --top_hits_count 50 --demux -profile singularity -with-report tt_out/tt_test_231005_report.html -with-dag ./work/tt_test_231005_taxtriage.html -resume

Relevant files

Here's the command.sh from the work directory where this breaks:

!/bin/bash -euo pipefail

echo 230029460_WB "-----------------META variable------------------" get_top_hits.py \ -i "230029460_WB.filtered.report" \ -o 230029460_WB.top_report.tsv \ -t 50

awk -F '\t' -v id=230029460_WB \ 'BEGIN{OFS="\t"} { if (NR==1){ print "SampleTaxid", $2, $1, $4, $6} else { $5 = id""$5; print $5, $2, $1, $4, $6 }}' 230029460_WB.top_report.tsv > 230029460_WB.krakenreport_mqc.tsv

cat <<-END_VERSIONS > versions.yml "NFCORE_TAXTRIAGE:TAXTRIAGE:TOP_HITS": python: $(python --version | sed 's/Python //g') END_VERSIONS

nextflow.log

System information

Nextflow version 24.04.2 Hardware HPC, Desktop, Cloud Executor slurm Container engine: Singularity OS CentOS Linux Version of nf-core/taxtriage 1.2.0

Merritt-Brian commented 11 months ago

@hkunerth can you check the contents of 230029460_WB.filtered.report to ensure that it isn't empty? Also, check that the reference FASTA files during DOWNLOAD_ASSEMBLY are being made.

I don't believe it's an issue with Top Hits but the way in which the fastq files are being registered in the overall pipeline.

hkunerth commented 11 months ago

The 230029460_WB.filtered.report exists and looks normal. DOWNLOAD_ASSEMBLY successfully downloaded for the other sample, 230029461, but never initiated for 230029460.

I can send you the nextflow report if it would be helpful at all. Happy to keep digging to try to solve this.

Merritt-Brian commented 10 months ago

Sure, can you pass the report privately if possible? I'm curious if there is some underlying issue with how the system might be parsing some of the taxa downstream

hkunerth commented 9 months ago

Hi @Merritt-Brian, I've come back to this and I think the issue is simpler than what I'd initially thought. I believe it's centered on the Samplesheet format. I had been using an older format, with the following header row:

sample,single_end,from,platform,barcode,fastq_1,fastq_2,sequencing_summary,trim

I tested some Illumina samples using the example format:

sample,platform,fastq_1,fastq_2,sequencing_summary,trim

and no longer ran into the issue. I think this is due to the older fields being put into the META variable during the samplesheet_check process and passed along to later processes which have issues with too many or too few arguments.

That said, I still hit a snag when running ONT data using the updated samplesheet format. I kept the format the same, except for changing platform to OXFORD and not having a fastq_2. This generated the old error:

Nov-29 09:38:20.942 [Actor Thread 9] DEBUG nextflow.Session - Session aborted -- Cause: No signature of method: Script_b7c7d140$_runScript_closure1$_closure3$_closure24.call() is applicable for argument types: (ArrayList) values: [[[id:Specimen-5-NP-RNA, single_end:true, platform:OXFORD, fastq_1:/home/mdh/shared/taxtriage/231129_test/samples/Specimen-5-NP-RNA.dehosted.fastq.gz, ...], ...]] Possible solutions: any(), any(), any(groovy.lang.Closure), each(groovy.lang.Closure), tap(groovy.lang.Closure), any(groovy.lang.Closure) Nov-29 09:38:20.970 [Actor Thread 9] DEBUG nextflow.Session - The following nodes are still active:

It looks to me like there should be more arguments in that array. The successful Illumina run populates it with

[id:230028305_CSF, single_end:false, platform:ILLUMINA, fastq_1:/home/mdh/shared/taxtriage/231129_test/samples_illu/230028305.Illumina.kraken.dehosted.1.fastq.gz, fastq_2:/home/mdh/shared/taxtriage/231129_test/samples_illu/230028305.Illumina.kraken.dehosted.2.fastq.gz, trim:true, directory:false, sequencing_summary:null]

but for some reason it breaks for ONT after fastq_1.

My sample sheet just leaves the fastq_2 field blank, as it is in the ONT example here: https://github.com/jhuapl-bio/taxtriage/blob/main/examples/Samplesheet.csv but I'm wondering if this might be causing issues with the META field.

Thoughts?

hkunerth commented 9 months ago

Nevermind, I ran some newly generated Illumina data and ran into the same issue. It looks to be the same as this issue https://github.com/jhuapl-bio/taxtriage/issues/45 raised by @erinyoung

Disregard my above message, but any help with this would be much appreciated. Thanks!

Merritt-Brian commented 9 months ago

@hkunerth can you provide your .nextflow.log file as well as the execution report (html) here regarding the issue put in #45

hkunerth commented 9 months ago

Here's a log from a failed run with this issue. .nextflow.log

I'm wondering if it is possible that the database that this run is using might be the cause of it. I changed a number of things in more recent runs but one was making sure it is pointed at the standard database (k2_standard_20230605) and I haven't run into this issue since then.

Thanks for the help.

Merritt-Brian commented 9 months ago

Ah found the syntax error in your command

--top_per_taxa = '"10239:20:S' '2:20:S"'

should be: --top_per_taxa "10239:20:S 2:20:S" i.e. no single quote and no equals sign. You can also see that on startup the value for top_per_taxa states it is "=" not "10239:20:S 2:20:S"