Cutadapt failing due to renaming of files

susheelbhanu commented 1 year ago

Description of the bug

Hi,

I'm getting the below error on the cutadapt step:

cutadapt: error: You provided 3 input file names, but either one or two are expected. The file names were:
   - 'D17_1.fastq.gz'
   - 'D17_1_1.fastq.gz'
   - 'D17_1_2.fastq.gz'

This is what my sample input file for the particular sample looks like

D17_1   /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_1.fastq.gz        /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_2.fastq.gz

Incidentally, I looked in the tmp folder and it looks like the renaming/splitting step is creating the additional files that cutadapt is getting confused with. See below:

ls /ssd0/susbus/ampliseq/work/0b/ed04d4de7d514c9b832a6a3686b9d6
D17_1_1.fastq.gz  D17_1_2.fastq.gz  D17_1.fastq.gz  D17_2.fastq.gz  versions.yml

The relevant files are attached here: ampliseq.zip

Is there a way to get around this issue apart from renaming the original files from "_{1,2}" to "_R{1,2}"?

Thank you, Susheel

Command used and terminal output

My launch command: 

nextflow run nf-core/ampliseq -r 2.6.1 -profile singularity --input sample_hebe_edited.tsv --FW_primer GCCAGCMGCCGCGGTAA --RV_primer CCGTCAATTCCTTTGAGTTT --outputdir "./hebe_16Sresults" --max_cpus 24 --max_memory 256.GB

Error message:
```bash
ERROR ~ Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:CUTADAPT_WORKFLOW:CUTADAPT_BASIC (D17_1)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:CUTADAPT_WORKFLOW:CUTADAPT_BASIC (D17_1)` terminated with an error exit status (2)

Command executed:

  cutadapt \
      --cores 6 \
      --minimum-length 1 -O 3 -e 0.1 -g GTGCCAGCMGCCGCGGTAA -G CCGTCAATTCCTTTGAGTTT --discard-untrimmed \
      -o D17_1.trimmed_1.trim.fastq.gz -p D17_1.trimmed_2.trim.fastq.gz \
      D17_1.fastq.gz D17_1_1.fastq.gz D17_1_2.fastq.gz \
      > D17_1.trimmed.cutadapt.log
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:CUTADAPT_WORKFLOW:CUTADAPT_BASIC":
      cutadapt: $(cutadapt --version)
  END_VERSIONS

Command exit status:
  2

Command output:
  (empty)

Command error:
  Run "cutadapt --help" to see command-line options.
  See https://cutadapt.readthedocs.io/ for full documentation.

  cutadapt: error: You provided 3 input file names, but either one or two are expected. The file names were:
   - 'D17_1.fastq.gz'
   - 'D17_1_1.fastq.gz'
   - 'D17_1_2.fastq.gz'
  Hint: If your path contains spaces, you need to enclose it in quotes

Work dir:
  /ssd0/susbus/ampliseq/work/05/51b2e73e304fcee8d66e87ffa0cde7

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details


### Relevant files

See .zip file attached above.

### System information
```bash
Core Nextflow options
  revision       : 2.6.1
  runName        : elegant_archimedes
  containerEngine: singularity
  launchDir      : /ssd0/susbus/ampliseq
  workDir        : /ssd0/susbus/ampliseq/work
  projectDir     : /home/susbus/.nextflow/assets/nf-core/ampliseq
  userName       : susbus
  profile        : singularity
  configFiles    : /home/susbus/.nextflow/assets/nf-core/ampliseq/nextflow.config

Main arguments
  input          : sample_hebe_edited.tsv
  FW_primer      : GTGCCAGCMGCCGCGGTAA
  RV_primer      : CCGTCAATTCCTTTGAGTTT
  outdir         : ./hebe_16Sresults

Max job request options
  max_cpus       : 24
  max_memory     : 256.GB

d4straub commented 1 year ago

Thanks for reporting it, sorry for the trouble. I never encountered this issue. I am speculating that the input file is also picked up as output file by https://github.com/nf-core/ampliseq/blob/3b252d263d101879c7077eae94a7a3d714b051aa/modules/local/rename_raw_data_files.nf#L14 This might be because the sampleID D17_1 is the base of your forward name D17_1.fastq.gz. If that is true, changing your samplesheet from D17_1 /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_1.fastq.gz /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_2.fastq.gz to D17 /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_1.fastq.gz /hdd0/susbus/nf_core/data/hebe_16S/00.RawData/D17/D17_2.fastq.gz should do the trick. Could you test that?

susheelbhanu commented 1 year ago

Yeah, I think that was the issue indeed. I renamed my input files 🙈, 'cos I like to make things complicated. I suppose the rename would have been the easiest and simplest option. Thanks for responding so quickly though.

I know someone already raised this but maybe an input file validation step will help in future releases, to avoid these cases. I expect they might happen with replicate samples, those which one doesn't want to treat as run in the input file.

Thanks again!

d4straub commented 1 year ago

Yes, thanks, there is already some input validation going on and more is done in the next release, but I think this problem wouldn't be identified by any test! Potential solutions:

Check (1) whether sampleID is ending with _1/_2, and if yes, whether it is identical to file base name of forwardReads & reverseReads (before .fastq.gz) --> error message with request to change sampleID.
Change rename_raw_data_files.nf output regex so that it doesnt care (not sure how thats done atm).

Will need to investigate further. Lets keep that issue open because its clearly a bug.

susheelbhanu commented 1 year ago

Great, thank you!

susheelbhanu commented 1 year ago

@d4straub quick question: I have reads from the Novogene where the barcode and the primer sequences were removed apparenlty.

When I run the following:

nextflow run nf-core/ampliseq -r 2.6.1 -profile singularity --input sample_hebe_edited.tsv --FW_primer GTGCCAGCMGCCGCGGTAA --RV_primer CCGTCAATTCCTTTGAGTTT --outdir "./hebe_16Sresults" --max_cpus 24 --max_memory 256.GB

I'm getting the below issue:

The following samples had too few reads (<1) after trimming with cutadapt:

Is it better to run the skip_cutadapt or the retain_untrimmed flags?

Thanks!

d4straub commented 1 year ago

It might be also that you use the wrong primer sequences.

Is it better to run the skip_cutadapt or the retain_untrimmed flags?

If its fine to you to run ampliseq potentially twice, use --skip_cutadapt first. If a large portion of reads (lets say, >10 or 15%) are removed due to being flagged as chimeric, use --retain_untrimmed -resume instead of --skip_cutadapt. That should reduce chimeric reads considerably. Such a question would be better suited to be asked in the nf-core slack channel #ampliseq, see https://nf-co.re/join

susheelbhanu commented 1 year ago

Awesome, thank you. To my knowledge, I'm using the primer sequences the company provided. And thanks for the link the slack channel. Will use that going forward.

d4straub commented 1 year ago

I added a fix linked above to dev branch, it will give a proper error message with the request to change the sampleID. Will be in the next release. SO I close that issue here.

nf-core / ampliseq

Cutadapt failing due to renaming of files #627

Description of the bug

Command used and terminal output