nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
187 stars 117 forks source link

Samples with '-' in name get converted to '.' after dada2 steps #536

Closed davised closed 1 year ago

davised commented 1 year ago

Description of the bug

After the nonchim output, any samples with hyphen '-' in name get converted to '.', I assume by a make.names() call from R.

Could be due to a row name or col name conversion as well.

$ grep combined overall_summary.tsv
P02A12-combined 38628   36238   36223   36198   36094   35702   NA  NA  NA  NA  NA  NA  NA NA   NA
P02A12.combined NA  NA  NA  NA  NA  NA  35702   35698   35698   35690   35690   35689   1   99.9971980947044    0.00280190529559832
P02H12-combined 36309   34070   34025   34040   32373   32359   NA  NA  NA  NA  NA  NA  NA NA   NA
P02H12.combined NA  NA  NA  NA  NA  NA  32359   32098   32098   32098   32098   31189   9097.1680478534488  2.83195214655119
P03A12-combined 23538   21565   21525   21555   21321   21319   NA  NA  NA  NA  NA  NA  NA NA   NA
P03A12.combined NA  NA  NA  NA  NA  NA  21319   21278   21278   21200   21200   20622   5797.2735849056604  2.72641509433961
P03H12-combined 2817    2624    2624    2624    2530    2530    NA  NA  NA  NA  NA  NA  NA NA   NA
P03H12.combined NA  NA  NA  NA  NA  NA  2530    2488    2488    2487    2487    1858    6274.7084841174105  25.2915158825895
P04A12-combined 25051   23153   23108   23146   22770   22729   NA  NA  NA  NA  NA  NA  NA NA   NA
P04A12.combined NA  NA  NA  NA  NA  NA  22729   22612   22612   21612   21612   21067   5497.478252822506   2.52174717749398

I have a few other comments regarding metadata. Should I make new issues for each one? I have a bugfix for one of them, regarding the metadata column names.

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 1 year ago

Thanks for the report! That issue came up before in https://github.com/nf-core/ampliseq/issues/423 and it was supposed to be solved by https://github.com/nf-core/ampliseq/pull/428 in release 2.3.2. What pipeline version do you use?

I have a few other comments regarding metadata. Should I make new issues for each one? I have a bugfix for one of them, regarding the metadata column names.

If those issues are widely unrelated, make different issues. but one issue with several points should also work, as you wish.

davised commented 1 year ago

I'm on 2.4.1.

This also became an issue in one of the qiime2 steps as the metadata did not match the sample names.

I resolved the issue by modifying my metadata file to have '.' instead of '-' and overwrote the errors. Whoops.

I joined slack so we can chat there as well.

davised commented 1 year ago

Here is my config.

$ cat TS027-ampliseq/pipeline_info/software_versions.yml
BARRNAP:
  barrnap: '0.9'
CUSTOM_DUMPSOFTWAREVERSIONS:
  python: 3.10.6
  yaml: '6.0'
DADA2_DENOISING:
  R: 4.1.1
  dada2: 1.22.0
DADA2_FILTNTRIM:
  R: 4.1.1
  dada2: 1.22.0
DADA2_QUALITY1:
  R: 4.1.1
  ShortRead: 1.52.0
  dada2: 1.22.0
DADA2_TAXONOMY:
  R: 4.1.1
  dada2: 1.22.0
FASTQC:
  fastqc: 0.11.9
FILTER_LEN_ASV:
  Biostrings: 2.58.0
  R: 4.0.3
FILTER_STATS:
  pandas: 1.1.5
  python: 3.9.1
QIIME2_INSEQ:
  qiime2: 2022.11.1
RENAME_RAW_DATA_FILES:
  sed: '4.7'
Workflow:
  Nextflow: 22.10.6
  nf-core/ampliseq: 2.4.1
$ cat TS027-ampliseq/pipeline_info/pipeline_report.txt
----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~\
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/ampliseq v2.4.1
----------------------------------------------------
Run Name: evil_ptolemy

## nf-core/ampliseq execution completed successfully! ##

The workflow was completed at 2023-02-08T00:33:55.417792590-08:00 (duration: 17m 2s)

The command used to launch the workflow was as follows:

  nextflow run /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow -params-file nf-params.json -profile singularity -work-dir /data/davised/ampliseq -resume

Pipeline Configuration:
-----------------------
 - runName: evil_ptolemy
 - containerEngine: singularity
 - launchDir: /nfs3/Sharpton_Lab/prod/projects/davised/scalebrain
 - workDir: /data/davised/ampliseq
 - projectDir: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow
 - userName: davised
 - profile: singularity
 - configFiles: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/nextflow.config, /nfs3/Sharpton_Lab/prod/projects/davised/scalebrain/nextflow.config
 - input: samplesheet.tsv
 - FW_primer: GTGYCAGCMGCCGCGGTAA
 - RV_primer: GGACTACNVGGGTWTCTAAT
 - metadata: scalebrain_metadata.filt.tsv
 - outdir: TS027-ampliseq
 - ignore_empty_input_files: true
 - retain_untrimmed: true
 - trunclenf: 240
 - trunclenr: 200
 - sample_inference: pooled
 - qiime_ref_taxonomy: silva=138
 - filter_ssu: bac,arc
 - min_len_asv: 245
 - max_len_asv: 260
 - min_frequency: 10
 - min_samples: 3
 - metadata_category: BaP.uM,Microbiome
 - metadata_category_barplot: BaP.uM,Microbiome
 - qiime_adonis_formula: BaP.uM*Microbiome
 - skip_cutadapt: true
 - email: davised@oregonstate.edu
 - custom_config_base: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/../configs/
 - Date Started: 2023-02-08T00:16:53.191159451-08:00
 - Date Completed: 2023-02-08T00:33:55.417792590-08:00
 - Pipeline script file path: /local/cluster/nextflow-pipelines/nf-core-ampliseq-2.4.1/workflow/main.nf
 - Pipeline script hash ID: be36b18b01608464e16a85b343c2dc73
 - Nextflow Version: 22.10.6
 - Nextflow Build: 5844
 - Nextflow Compile Timestamp: 23-01-2023 23:25 UTC

--
nf-core/ampliseq
https://github.com/nf-core/ampliseq
davised commented 1 year ago

Ok, I did some digging.

Looks like it happens again in this module workflow/modules/local/filter_ssu.nf

$ rg stats.ssu.tsv
workflow/docs/output.md
151:  - `stats.ssu.tsv`: Tracking read numbers through filtering, for each sample.

workflow/modules/local/filter_ssu.nf
16:    path( "stats.ssu.tsv" )      , emit: stats
73:    write.table(stats, file = "stats.ssu.tsv", row.names=FALSE, sep="\t")

Here is my output of the combined samples that have the error.

$ grep combined barrnap/stats.ssu.tsv
35702   35698   "P02A12.combined"
32359   32098   "P02H12.combined"
21319   21278   "P03A12.combined"
2530    2488    "P03H12.combined"
22729   22612   "P04A12.combined"

Instead of working around sample names that have hyphens, I suggest checking the sample names for hyphens and other non-allowed 'name' characters in R.

https://search.r-project.org/R/refmans/base/html/make.names.html

letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.

d4straub commented 1 year ago

Instead of working around sample names that have hyphens, I suggest checking the sample names for hyphens and other non-allowed 'name' characters in R.

Sounds like a great idea, if you have the time to add such a check that would be great!

davised commented 1 year ago

I'll see what I can do!

I run my own analyses here but also provide bioinformatics support to users, so I get a double benefit from any improvements I can help with.

I'll follow up on slack.

davised commented 1 year ago

Hi, I'm back running this pipeline again.

I tried to preempt this issue by converting hyphens to periods, but then I ran into another issue.

Running this again, looks like dots/periods are filtered out by the parse_input.nf subworkflow. I'm commenting out https://github.com/nf-core/ampliseq/blob/78b7514ceeba80efb66b0e973e5321878cb9b0ba/subworkflows/local/parse_input.nf#L133

I don't know if that is necessary for downstream stuff, maybe a regex?, but I guess I'll find out shortly. I could easily change this step to filter out hyphens, which would resolve the original issue.

qiime2 devs recommend only alphanumeric, dot, and hyphen. Since R doesn't play well with hyphen, that leaves alphanumeric with dot as delimiter if we are restricting to both R and qiime2 requirements. https://docs.qiime2.org/2023.2/tutorials/metadata/#recommendations-for-identifiers

davised commented 1 year ago

Ah ok so the dots aren't good for the first part of the pipeline.

$ head -n 11 TS031-ampliseq/overall_summary.tsv
sample  DADA2_input filtered    denoisedF   denoisedR   merged  nonchim ssufilter_input ssufilter_output    lenfilter_input lenfilter_output    input_tax_filter    filtered_tax_filter lost    retained_percent    lost_percent
EMC 13  8   6   5   5   5   NA  NA  NA  NA  NA  NA  NA  NA  NA
EMC 86  27  24  24  23  23  NA  NA  NA  NA  NA  NA  NA  NA  NA
EMC 42  12  12  10  9   9   NA  NA  NA  NA  NA  NA  NA  NA  NA
EMC 31947   30450   30099   30184   27850   24471   NA  NA  NA  NA  NA  NA  NA  NA  NA
EMC 33980   31590   31178   31353   29291   26335   NA  NA  NA  NA  NA  NA  NA  NA  NA
EMC.kit.blank.1 NA  NA  NA  NA  NA  NA  5   5   5   5   5   1   4   20  80
EMC.kit.blank.2 NA  NA  NA  NA  NA  NA  23  23  23  23  23  9   14  39.1304347826087    60.8695652173913
EMC.PCR.blank   NA  NA  NA  NA  NA  NA  9   9   9   9   9   9   0   100 0
EMC.test1.blank NA  NA  NA  NA  NA  NA  24471   24471   24471   24471   24471   24320   151 99.3829430754771    0.617056924522913
EMC.test2.blank NA  NA  NA  NA  NA  NA  26335   26335   26335   26335   26335   26165   170 99.3544712359977    0.645528764002279

So I suppose someone needs to decide if any delimiters should be allowed.

erikrikarddaniel commented 1 year ago

Den ons 17 maj 2023 kl 22:03 skrev Ed Davis @.***>:

Ah ok so the dots aren't good for the first part of the pipeline.

$ head -n 11 TS031-ampliseq/overall_summary.tsv sample DADA2_input filtered denoisedF denoisedR merged nonchim ssufilter_input ssufilter_output lenfilter_input lenfilter_output input_tax_filter filtered_tax_filter lost retained_percent lost_percent EMC 13 8 6 5 5 5 NA NA NA NA NA NA NA NA NA EMC 86 27 24 24 23 23 NA NA NA NA NA NA NA NA NA EMC 42 12 12 10 9 9 NA NA NA NA NA NA NA NA NA EMC 31947 30450 30099 30184 27850 24471 NA NA NA NA NA NA NA NA NA EMC 33980 31590 31178 31353 29291 26335 NA NA NA NA NA NA NA NA NA EMC.kit.blank.1 NA NA NA NA NA NA 5 5 5 5 5 1 4 20 80 EMC.kit.blank.2 NA NA NA NA NA NA 23 23 23 23 23 9 14 39.1304347826087 60.8695652173913 EMC.PCR.blank NA NA NA NA NA NA 9 9 9 9 9 9 0 100 0 EMC.test1.blank NA NA NA NA NA NA 24471 24471 24471 24471 24471 24320 151 99.3829430754771 0.617056924522913 EMC.test2.blank NA NA NA NA NA NA 26335 26335 26335 26335 26335 26165 170 99.3544712359977 0.645528764002279

So I suppose someone needs to decide if any delimiters should be allowed.

Try underscores.

/D

— Reply to this email directly, view it on GitHub https://github.com/nf-core/ampliseq/issues/536#issuecomment-1551975824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALTHQDC7XWU3BEUEU6XZU3XGUVKNANCNFSM6AAAAAAUU6ORSE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

d4straub commented 1 year ago

I hope, underscores are fine.

davised commented 1 year ago

underscores are probably fine, yes, I was not using them because of the qiime recommendations. apparently deblur doesn't tolerate underscores at all, which is probably why they advise against them.

d4straub commented 1 year ago

This here would be a way to deal with it: https://github.com/nf-core/ampliseq/pull/616 Simpler might be a regex as above added for sampleID.

d4straub commented 1 year ago

616 now restricts sampleID to

"Unique sample ID must be provided: Must start with a letter, and can only contain letters, numbers or underscores; Regex: '^[a-zA-Z][a-zA-Z0-9_]+$'"

That should do. Please let me know if not.