nmdp-bioinformatics / pipeline

Consensus assembly and allele interpretation pipeline.
GNU Lesser General Public License v3.0
7 stars 7 forks source link

Is the ngs-fastq-to-ssake step still appropriate? #58

Closed ghost closed 9 years ago

ghost commented 9 years ago

From what I see, it still is

$ ./SSAKE -h
Usage: ./SSAKE [v3.8.2]
-f  File containing all the [paired (-p 1)] reads (required)
      With -p 1:
    ! Target insert size must be indicated at the end of the header line (e.g. :200 for a 200bp insert)
    ! Paired reads must be separated by ":"
      >template_name:200
      ACGACACTATGCATAAGCAGACGAGCAGCGACGCAGCACG:GCGCACGACGCAGCACAGCAGCAGACGAC
...
-g  Fasta file containing unpaired sequence reads (optional)
ghost commented 9 years ago

@ckennedy-nmdp Could you confirm? I don't think using FASTQ for the -g argument works either, so the bash script should convert FASTQ to FASTA in the case there are unpaired reads.

ckennedy-nmdp commented 9 years ago

-g is only applicable with -p. For unpaired data the pipeline runs SSAKE with -f though I don't think it uses qualities in the assembly so it could convert to FASTA.

ghost commented 9 years ago

Sorry, I still don't follow.

https://github.com/nmdp-bioinformatics/pipeline/blob/master/process_fastq.bash#L139

uses only -f in both cases.

warrenlr commented 9 years ago

Just to confirm:

SSAKE doesn't accept fastq files and does not take base quality as input. It reads in fasta or, in the case of paired-end mode, a custom format akin to fasta.

A. When running assembler in paired-end mode (-p 1):

-f file.paired

NameOfPair:FRAGMENTSIZE read1:read2

-g file.fa (fasta file)

NameOfUnpairedRead readx

B. Running ssake in unpaired mode

-f (fasta file)

warrenlr commented 9 years ago

I forgot to mention that there are scripts in the SSAKE "tools" folder for quality trimming reads and formatting into the custom paired input required by the assembler in paired-end mode.

Below is example code that would trim (and convert fastq)

fastq-->fasta-->custom input

../tools/TQSfastq.py -f Ecoli_S1_L001_R1_001.fastq -t 20 -c 30 -e 33 ../tools/TQSfastq.py -f Ecoli_S1_L001_R2_001.fastq -t 20 -c 30 -e 33 cat Ecoli_S1_L001_R1_001.fastq.1_T20C30E33.trim.fa |perl -ne 'if(/^(>\@\S+)/){print "$1b\n";}else{print;}' >trimFIX1.fa cat Ecoli_S1_L001_R2_001.fastq.1_T20C30E33.trim.fa |perl -ne 'if(/^(>\@\S+)/){print "$1a\n";}else{print;}' >trimFIX2.fa echo ----------------------------------------------------------------------------------- echo done. Formatting fasta input for SSAKE... echo ----------------------------------------------------------------------------------- ../tools/makePairedOutput2UNEQUALfiles.pl trimFIX1.fa trimFIX2.fa 550 echo ----------------------------------------------------------------------------------- echo done. Initiating SSAKE assembly ETA 10-20min depending on system... echo ----------------------------------------------------------------------------------- time ../SSAKE -f paired.fa -g unpaired.fa -p 1 -m 80 -w 100

ghost commented 9 years ago

Thank you, @warrenlr.

@ckennedy-nmdp, this line

https://github.com/nmdp-bioinformatics/pipeline/blob/master/process_fastq.bash#L136

is passing though unpaired FASTQ formatted sequences into SSAKE using the -f option, which will not work. I'll create a new issue.