Illumina single-end data

rki-mf1 / clean

A nextflow pipeline for decontamination of short reads, long reads and contigs

BSD 3-Clause "New" or "Revised" License

30 stars 3 forks source link

Illumina single-end data #4

Closed hoelzer closed 4 years ago

MarieLataretu commented 4 years ago

I'd add an extra parameter --illumina-single-end (like --nano and --illumina), so that one can clean single- and paired-end reads in one clean run

hoelzer commented 4 years ago

Ah yes, that's a good solution!

MarieLataretu notifications@github.com schrieb am Mi., 20. Mai 2020, 19:55:

I'd add an extra parameter --illumina-single-end (like --nano and --illumina), so that one can clean single- and paired-end reads in one clean run

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/hoelzer/clean/issues/4#issuecomment-631630595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADN2CZ6RIQCSXNXUBPOBRKDRSQKQ7ANCNFSM4KLFRFNQ .

MarieLataretu commented 4 years ago

I just scrolled by - Is the renaming of the reads applicable also for the single-end reads?

hoelzer commented 4 years ago

renaming of the reads? do you have an example?

MarieLataretu commented 4 years ago

We do this, before mapping:


  # this is working for ENA reads that have at the end of a read id '/1' or '/2'
  EXAMPLE_ID=\$(zcat ${reads[0]} | head -1)
  if [[ \$EXAMPLE_ID == */1 ]]; then 
    if [[ ${reads[0]} =~ \\.gz\$ ]]; then
      zcat ${reads[0]} | sed 's/ /DECONTAMINATE/g' > ${name}.R1.id.fastq
      TOTALREADS_1=\$(zcat ${reads[0]} | echo \$((`wc -l`/4)))
    else
      sed 's/ /DECONTAMINATE/g' ${reads[0]} > ${name}.R1.id.fastq
      TOTALREADS_1=\$(cat ${reads[0]} | echo \$((`wc -l`/4)))
    fi
    if [[ ${reads[1]} =~ \\.gz\$ ]]; then
      zcat ${reads[1]} | sed 's/ /DECONTAMINATE/g' > ${name}.R2.id.fastq
      TOTALREADS_2=\$(zcat ${reads[1]} | echo \$((`wc -l`/4)))
    else
      sed 's/ /DECONTAMINATE/g' ${reads[1]} > ${name}.R2.id.fastq
      TOTALREADS_2=\$(cat ${reads[1]} | echo \$((`wc -l`/4)))
    fi
  else
[....]```

But I just saw, that we also do this for the ONT data, so I'll implement this also for the Illumina singe-end data!

hoelzer commented 4 years ago

Ah sorry, I got confused with the rnaseq pipeline ;)

Yeah, I introduced this renaming stuff because I experienced problems with some FASTQ headers. I think what we could also have is a more convenient renaming Python script or so that

renames the reads
saves the mapping between the original and new ids in a tsv
restores ids based on the tsv

see: https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/bin/rename_fasta.py

So we could have a separate rename step for any FASTQ, then the filtering happens, and then we have a restore module...

maybe that's cleaner?

But I am also happy with any other simple solution

MarieLataretu commented 4 years ago

yeah, an extra process for renaming would definitely reduce code redundancy!

I'll go for the copy-paste solution for the moment and open a new issue