Closed agrier-wcm closed 7 months ago
Thanks for the report. Would you be able to open a PR (based of dev branch)?
I would be happy to. I'm not really sure what an appropriate test would be? It only fails at large scale. How do we do a github actions test for something like that in a reasonable way?
I think there isnt really an appropriate github actions test for that. Some things are not reasonably tested on small datasets unfortunately. It would be sufficient for me when nothing is broken in the current tests and you tested it with your dataset. I might also test it with a random dataset (I usually dont use clustering...). It will come up in the future again in case its not solved sufficiently.
Same error here
filt_clusters.py.zip filter_clusters.nf.zip
Attached are the two files with the necessary corrections: ampliseq/modules/local/filter_clusters.nf and ampliseq/bin/filt_clusters.py (unzip them of course)
I know sharing files like this is not best practice. I will do a PR for this in the next few days unless someone beats me to it.
Hi there, are you still planning to do a PR? If not, maybe someone else can tackle the problem in the next few days?
I have opened a PR as linked above. Simply added your files for now.
Ok thats in the dev branch and will be in the next release. Thanks!
Description of the bug
When using
--vsearch_cluster
, if you have many thousands of clusters,AMPLISEQ:FILTER_CLUSTERS
will fail with anArgument list too long
error.The reason is line 27 in
ampliseq/modules/local/filter_clusters.nf
:filt_clusters.py -t ${asv} -p ${prefix} -c ${clusters}
We're passing the list of names of individual cluster files as one long, space delimited string to the
-c
argument. When there are many thousands (in my case, ~6,500) of cluster file names, this breaks the script because the argument string is just too long.My nextflow and bash scripting-foo is a bit rusty, but I did come up with a simple fix, which is to pipe in the cluster list.
Change line 27 in
ampliseq/modules/local/filter_clusters.nf
to:echo ${clusters} | filt_clusters.py -t ${asv} -p ${prefix} -c -
Then change line 33 in
ampliseq/bin/filt_clusters.py
fromtype=str,
to:This will read the cluster list from the pipe. Then, in that same file, set the
count
,prefix
, andcluster_fastas
variables directly:Use these variables throughout the script as need (lines 45, 50, 80, 110, & 111; 45 is already correct, but 44 should be changed to include
.read().rstrip()
as above; also deleted line 38).There may be a more elegant solution and setting the
count
andprefix
variables directly may be a totally unnecessary change.Command used and terminal output
Relevant files
No response
System information
nextflow version 23.04.2.5870 ampliseq version 2.8.0 singularity profile