nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
187 stars 117 forks source link

When using `--vsearch_cluster`, if you have many thousands of clusters, `AMPLISEQ:FILTER_CLUSTERS` will fail with an `Argument list too long` error. #696

Closed agrier-wcm closed 7 months ago

agrier-wcm commented 9 months ago

Description of the bug

When using --vsearch_cluster, if you have many thousands of clusters, AMPLISEQ:FILTER_CLUSTERS will fail with anArgument list too long error.

The reason is line 27 in ampliseq/modules/local/filter_clusters.nf:

filt_clusters.py -t ${asv} -p ${prefix} -c ${clusters}

We're passing the list of names of individual cluster files as one long, space delimited string to the -c argument. When there are many thousands (in my case, ~6,500) of cluster file names, this breaks the script because the argument string is just too long.

My nextflow and bash scripting-foo is a bit rusty, but I did come up with a simple fix, which is to pipe in the cluster list.

Change line 27 in ampliseq/modules/local/filter_clusters.nf to: echo ${clusters} | filt_clusters.py -t ${asv} -p ${prefix} -c -

Then change line 33 in ampliseq/bin/filt_clusters.py from type=str, to:

type=argparse.FileType('r'),
default=sys.stdin,

This will read the cluster list from the pipe. Then, in that same file, set the count, prefix, and cluster_fastas variables directly:

count = parser.parse_args().count
prefix = parser.parse_args().prefix
cluster_fastas = parser.parse_args().cluster_fastas.read().rstrip().split(" ")

Use these variables throughout the script as need (lines 45, 50, 80, 110, & 111; 45 is already correct, but 44 should be changed to include .read().rstrip() as above; also deleted line 38).

There may be a more elegant solution and setting the count and prefix variables directly may be a totally unnecessary change.

Command used and terminal output

Command:

nextflow run ampliseq -profile singularity --input merged_SampleSheet.tsv --FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer CCGYCAATTYMTTTRAGTTT --metadata merged_Metadata.tsv --outdir ./clustered_picrust_merged_results --email me@email.edu --dada_ref_taxonomy silva --ignore_empty_input_files --ignore_failed_trimming --ignore_failed_filtering --picrust --vsearch_cluster --min_frequency 2 --retain_untrimmed --trunclenf 240 --trunclenr 160 --metadata_category_barplot "Whatever" --tax_agglom_max 7 --max_memory 32.GB

Output:

nf-core/ampliseq execution completed unsuccessfully!

The exit status of the task that caused the workflow execution to fail was: 1.

The full error message was:

Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:FILTER_CLUSTERS (ASV_post_clustering)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:FILTER_CLUSTERS (ASV_post_clustering)` terminated with an error exit status (1)

Command executed:

  filt_clusters.py -t ASV_table.tsv -p 'ASV_post_clustering' -c 'ASV_post_clustering.clusters.fasta0.gz ASV_post_clustering.clusters.fasta1.gz ASV_post_clustering.clusters.fasta10.gz ASV_post_clustering.clusters.fasta100.gz <THIS GOES ON FOR THOUSANDS OF FILES>'

cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:FILTER_CLUSTERS":
      python: $(python --version 2>&1 | sed 's/Python //g')
      pandas: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  .command.sh: line 3: /mydir/ampliseq/bin/filt_clusters.py: Argument list too long

Work dir:
  /mydir/work/73/d3179e4acdd2e4c5b4b0d7b16daba1

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Relevant files

No response

System information

nextflow version 23.04.2.5870 ampliseq version 2.8.0 singularity profile

d4straub commented 9 months ago

Thanks for the report. Would you be able to open a PR (based of dev branch)?

agrier-wcm commented 9 months ago

I would be happy to. I'm not really sure what an appropriate test would be? It only fails at large scale. How do we do a github actions test for something like that in a reasonable way?

d4straub commented 9 months ago

I think there isnt really an appropriate github actions test for that. Some things are not reasonably tested on small datasets unfortunately. It would be sufficient for me when nothing is broken in the current tests and you tested it with your dataset. I might also test it with a random dataset (I usually dont use clustering...). It will come up in the future again in case its not solved sufficiently.

eperezv commented 9 months ago

Same error here

agrier-wcm commented 9 months ago

filt_clusters.py.zip filter_clusters.nf.zip

Attached are the two files with the necessary corrections: ampliseq/modules/local/filter_clusters.nf and ampliseq/bin/filt_clusters.py (unzip them of course)

I know sharing files like this is not best practice. I will do a PR for this in the next few days unless someone beats me to it.

d4straub commented 8 months ago

Hi there, are you still planning to do a PR? If not, maybe someone else can tackle the problem in the next few days?

d4straub commented 8 months ago

I have opened a PR as linked above. Simply added your files for now.

d4straub commented 7 months ago

Ok thats in the dev branch and will be in the next release. Thanks!