tseemann / shovill

⚡♠️ Assemble bacterial isolate genomes from Illumina paired-end reads
GNU General Public License v3.0
213 stars 45 forks source link

Documentation requested for all output files #138

Open lskatz opened 4 years ago

lskatz commented 4 years ago

Not all files are described when I used --keepfiles. Could there be documentation for them, even if it's not in the main documentation? The additional files are the files I had after a skesa run. I am adding onto your table.

Filename Description
contigs.fa The final assembly you should use
shovill.log Full log file for bug reporting
shovill.corrections List of post-assembly corrections
contigs.gfa Assembly graph (spades)
contigs.fastg Assembly graph (megahit)
contigs.LastGraph Assembly graph (velvet)
skesa.fasta Raw assembly (skesa)
spades.fasta Raw assembled contigs (spades)
megahit.fasta Raw assembly (megahit)
velvet.fasta Raw assembly (velvet)
flash.extendedFrags.fastq.gz
flash.hist
flash.histogram
flash.notCombined_1.fastq.gz
flash.notCombined_2.fastq.gz
R1.cor.fq.gz Corrected R1 reads from Ligher
R1.fq.gz Original R1
R1.sub.fq.gz Subsampled R1 using --depth
R2.cor.fq.gz Corrected R2 reads from Ligher
R2.fq.gz Original R2
R2.sub.fq.gz Subsampled R2 using --depth
shovill.bam Alignment of corrected reads from Trimmomatic against the vanilla assembly (skesa, spades, or megahit). Reads have been sorted with Samtools and filtered with samclip (ie, nuanced alignments have been removed).
shovill.bam.bai Index file for shovill.bam
skesa.fasta.amb
skesa.fasta.ann
skesa.fasta.bwt
skesa.fasta.fai
skesa.fasta.pac
skesa.fasta.sa
skesa.fasta.uncorrected
lskatz commented 4 years ago

And my question leading into this was whether the bam file is all reads mapped against the assembly or only those trimmed/cleaned/flash'd.

tseemann commented 4 years ago

@lskatz can you look at the perl code and tell me? :)

The original (or trimmomaticked originals if --trim) are used for alignment/correction.

I do NOT use the FLASH ones because one of the potential errors I want to correct is mis-FLASH-ed reads.

lskatz commented 4 years ago

I took some language from

Filename Description
contigs.fa The final assembly you should use
shovill.log Full log file for bug reporting
shovill.corrections List of post-assembly corrections
contigs.gfa Assembly graph (spades)
contigs.fastg Assembly graph (megahit)
contigs.LastGraph Assembly graph (velvet)
skesa.fasta Raw assembly (skesa)
spades.fasta Raw assembled contigs (spades)
megahit.fasta Raw assembly (megahit)
velvet.fasta Raw assembly (velvet)
flash.extendedFrags.fastq.gz Single-end reads from Flash, ie, merged reads
flash.hist Numeric histogram of merged read lengths.
flash.histogram Visual histogram of merged read lengths.
flash.notCombined_1.fastq.gz R1 reads not combined in Flash
flash.notCombined_2.fastq.gz R2 reads not combined in Flash
R1.cor.fq.gz Corrected R1 reads from Lighter
R1.fq.gz Original R1
R1.sub.fq.gz Subsampled R1 using --depth
R2.cor.fq.gz Corrected R2 reads from Ligher
R2.fq.gz Original R2
R2.sub.fq.gz Subsampled R2 using --depth
shovill.bam Alignment of corrected reads from Trimmomatic against the vanilla assembly (skesa, spades, or megahit). Reads have been sorted with Samtools and filtered with samclip (ie, nuanced alignments have been removed).
shovill.bam.bai Index file for shovill.bam
skesa.fasta.amb (bwa index) text file, to record appearance of N (or other non-ATGC) in the ref fasta.
skesa.fasta.ann (bwa index) text file, to record ref sequences, name, length, etc.
skesa.fasta.bwt (bwa index) binary, the Burrows-Wheeler transformed sequence.
skesa.fasta.fai Samtools index file
skesa.fasta.pac (bwa index) binary, packaged sequence (four base pairs encode one byte).
skesa.fasta.sa (bwa index) binary, suffix array index.
skesa.fasta.uncorrected Assembly before pilon correction