Closed davised closed 1 year ago
Thanks for the report! That issue came up before in https://github.com/nf-core/ampliseq/issues/423 and it was supposed to be solved by https://github.com/nf-core/ampliseq/pull/428 in release 2.3.2. What pipeline version do you use?
I have a few other comments regarding metadata. Should I make new issues for each one? I have a bugfix for one of them, regarding the metadata column names.
If those issues are widely unrelated, make different issues. but one issue with several points should also work, as you wish.
I'm on 2.4.1.
This also became an issue in one of the qiime2 steps as the metadata did not match the sample names.
I resolved the issue by modifying my metadata file to have '.' instead of '-' and overwrote the errors. Whoops.
I joined slack so we can chat there as well.
Here is my config.
$ cat TS027-ampliseq/pipeline_info/software_versions.yml
BARRNAP:
barrnap: '0.9'
CUSTOM_DUMPSOFTWAREVERSIONS:
python: 3.10.6
yaml: '6.0'
DADA2_DENOISING:
R: 4.1.1
dada2: 1.22.0
DADA2_FILTNTRIM:
R: 4.1.1
dada2: 1.22.0
DADA2_QUALITY1:
R: 4.1.1
ShortRead: 1.52.0
dada2: 1.22.0
DADA2_TAXONOMY:
R: 4.1.1
dada2: 1.22.0
FASTQC:
fastqc: 0.11.9
FILTER_LEN_ASV:
Biostrings: 2.58.0
R: 4.0.3
FILTER_STATS:
pandas: 1.1.5
python: 3.9.1
QIIME2_INSEQ:
qiime2: 2022.11.1
RENAME_RAW_DATA_FILES:
sed: '4.7'
Workflow:
Nextflow: 22.10.6
nf-core/ampliseq: 2.4.1
$ cat TS027-ampliseq/pipeline_info/pipeline_report.txt
----------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~\
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/ampliseq v2.4.1
----------------------------------------------------
Run Name: evil_ptolemy
## nf-core/ampliseq execution completed successfully! ##
The workflow was completed at 2023-02-08T00:33:55.417792590-08:00 (duration: 17m 2s)
The command used to launch the workflow was as follows:
nextflow run /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow -params-file nf-params.json -profile singularity -work-dir /data/davised/ampliseq -resume
Pipeline Configuration:
-----------------------
- runName: evil_ptolemy
- containerEngine: singularity
- launchDir: /nfs3/Sharpton_Lab/prod/projects/davised/scalebrain
- workDir: /data/davised/ampliseq
- projectDir: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow
- userName: davised
- profile: singularity
- configFiles: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/nextflow.config, /nfs3/Sharpton_Lab/prod/projects/davised/scalebrain/nextflow.config
- input: samplesheet.tsv
- FW_primer: GTGYCAGCMGCCGCGGTAA
- RV_primer: GGACTACNVGGGTWTCTAAT
- metadata: scalebrain_metadata.filt.tsv
- outdir: TS027-ampliseq
- ignore_empty_input_files: true
- retain_untrimmed: true
- trunclenf: 240
- trunclenr: 200
- sample_inference: pooled
- qiime_ref_taxonomy: silva=138
- filter_ssu: bac,arc
- min_len_asv: 245
- max_len_asv: 260
- min_frequency: 10
- min_samples: 3
- metadata_category: BaP.uM,Microbiome
- metadata_category_barplot: BaP.uM,Microbiome
- qiime_adonis_formula: BaP.uM*Microbiome
- skip_cutadapt: true
- email: davised@oregonstate.edu
- custom_config_base: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/../configs/
- Date Started: 2023-02-08T00:16:53.191159451-08:00
- Date Completed: 2023-02-08T00:33:55.417792590-08:00
- Pipeline script file path: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/main.nf
- Pipeline script hash ID: be36b18b01608464e16a85b343c2dc73
- Nextflow Version: 22.10.6
- Nextflow Build: 5844
- Nextflow Compile Timestamp: 23-01-2023 23:25 UTC
--
nf-core/ampliseq
https://github.com/nf-core/ampliseq
Ok, I did some digging.
Looks like it happens again in this module workflow/modules/local/filter_ssu.nf
$ rg stats.ssu.tsv
workflow/docs/output.md
151: - `stats.ssu.tsv`: Tracking read numbers through filtering, for each sample.
workflow/modules/local/filter_ssu.nf
16: path( "stats.ssu.tsv" ) , emit: stats
73: write.table(stats, file = "stats.ssu.tsv", row.names=FALSE, sep="\t")
Here is my output of the combined samples that have the error.
$ grep combined barrnap/stats.ssu.tsv
35702 35698 "P02A12.combined"
32359 32098 "P02H12.combined"
21319 21278 "P03A12.combined"
2530 2488 "P03H12.combined"
22729 22612 "P04A12.combined"
Instead of working around sample names that have hyphens, I suggest checking the sample names for hyphens and other non-allowed 'name' characters in R.
https://search.r-project.org/R/refmans/base/html/make.names.html
letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.
Instead of working around sample names that have hyphens, I suggest checking the sample names for hyphens and other non-allowed 'name' characters in R.
Sounds like a great idea, if you have the time to add such a check that would be great!
I'll see what I can do!
I run my own analyses here but also provide bioinformatics support to users, so I get a double benefit from any improvements I can help with.
I'll follow up on slack.
Hi, I'm back running this pipeline again.
I tried to preempt this issue by converting hyphens to periods, but then I ran into another issue.
Running this again, looks like dots/periods are filtered out by the parse_input.nf
subworkflow. I'm commenting out https://github.com/nf-core/ampliseq/blob/78b7514ceeba80efb66b0e973e5321878cb9b0ba/subworkflows/local/parse_input.nf#L133
I don't know if that is necessary for downstream stuff, maybe a regex?, but I guess I'll find out shortly. I could easily change this step to filter out hyphens, which would resolve the original issue.
qiime2 devs recommend only alphanumeric, dot, and hyphen. Since R doesn't play well with hyphen, that leaves alphanumeric with dot as delimiter if we are restricting to both R and qiime2 requirements. https://docs.qiime2.org/2023.2/tutorials/metadata/#recommendations-for-identifiers
Ah ok so the dots aren't good for the first part of the pipeline.
$ head -n 11 TS031-ampliseq/overall_summary.tsv
sample DADA2_input filtered denoisedF denoisedR merged nonchim ssufilter_input ssufilter_output lenfilter_input lenfilter_output input_tax_filter filtered_tax_filter lost retained_percent lost_percent
EMC 13 8 6 5 5 5 NA NA NA NA NA NA NA NA NA
EMC 86 27 24 24 23 23 NA NA NA NA NA NA NA NA NA
EMC 42 12 12 10 9 9 NA NA NA NA NA NA NA NA NA
EMC 31947 30450 30099 30184 27850 24471 NA NA NA NA NA NA NA NA NA
EMC 33980 31590 31178 31353 29291 26335 NA NA NA NA NA NA NA NA NA
EMC.kit.blank.1 NA NA NA NA NA NA 5 5 5 5 5 1 4 20 80
EMC.kit.blank.2 NA NA NA NA NA NA 23 23 23 23 23 9 14 39.1304347826087 60.8695652173913
EMC.PCR.blank NA NA NA NA NA NA 9 9 9 9 9 9 0 100 0
EMC.test1.blank NA NA NA NA NA NA 24471 24471 24471 24471 24471 24320 151 99.3829430754771 0.617056924522913
EMC.test2.blank NA NA NA NA NA NA 26335 26335 26335 26335 26335 26165 170 99.3544712359977 0.645528764002279
So I suppose someone needs to decide if any delimiters should be allowed.
Den ons 17 maj 2023 kl 22:03 skrev Ed Davis @.***>:
Ah ok so the dots aren't good for the first part of the pipeline.
$ head -n 11 TS031-ampliseq/overall_summary.tsv sample DADA2_input filtered denoisedF denoisedR merged nonchim ssufilter_input ssufilter_output lenfilter_input lenfilter_output input_tax_filter filtered_tax_filter lost retained_percent lost_percent EMC 13 8 6 5 5 5 NA NA NA NA NA NA NA NA NA EMC 86 27 24 24 23 23 NA NA NA NA NA NA NA NA NA EMC 42 12 12 10 9 9 NA NA NA NA NA NA NA NA NA EMC 31947 30450 30099 30184 27850 24471 NA NA NA NA NA NA NA NA NA EMC 33980 31590 31178 31353 29291 26335 NA NA NA NA NA NA NA NA NA EMC.kit.blank.1 NA NA NA NA NA NA 5 5 5 5 5 1 4 20 80 EMC.kit.blank.2 NA NA NA NA NA NA 23 23 23 23 23 9 14 39.1304347826087 60.8695652173913 EMC.PCR.blank NA NA NA NA NA NA 9 9 9 9 9 9 0 100 0 EMC.test1.blank NA NA NA NA NA NA 24471 24471 24471 24471 24471 24320 151 99.3829430754771 0.617056924522913 EMC.test2.blank NA NA NA NA NA NA 26335 26335 26335 26335 26335 26165 170 99.3544712359977 0.645528764002279
So I suppose someone needs to decide if any delimiters should be allowed.
Try underscores.
/D
— Reply to this email directly, view it on GitHub https://github.com/nf-core/ampliseq/issues/536#issuecomment-1551975824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALTHQDC7XWU3BEUEU6XZU3XGUVKNANCNFSM6AAAAAAUU6ORSE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I hope, underscores are fine.
underscores are probably fine, yes, I was not using them because of the qiime recommendations. apparently deblur doesn't tolerate underscores at all, which is probably why they advise against them.
This here would be a way to deal with it: https://github.com/nf-core/ampliseq/pull/616 Simpler might be a regex as above added for sampleID.
"Unique sample ID must be provided: Must start with a letter, and can only contain letters, numbers or underscores; Regex: '^[a-zA-Z][a-zA-Z0-9_]+$'"
That should do. Please let me know if not.
Description of the bug
After the nonchim output, any samples with hyphen '-' in name get converted to '.', I assume by a
make.names()
call from R.Could be due to a row name or col name conversion as well.
I have a few other comments regarding metadata. Should I make new issues for each one? I have a bugfix for one of them, regarding the metadata column names.
Command used and terminal output
No response
Relevant files
No response
System information
No response