gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

incomplete pairs #49

Closed nesilin closed 7 years ago

nesilin commented 7 years ago

Hi!

I have a very naive question. When using bamtofastq the BAM file is split into read group1 and read group2 of pair reads and then there are the incomplete reads also coming from group1 and group2.

What exactly "incomplete" refers to in the documentation (https://github.com/gt1/biobambam2/blob/master/src/programs/bamtofastq.1)? Can you please tell whether incomplete means unpaired(=unmatched) reads? Does this have anything to do with unmapped reads?

-outputperreadgroupsuffixO=<_o1.fq> output file name suffix for first mates of incomplete pairs if outputperreadgroup=1. Default is _o1.fq if gz=0 and _o1.fq.gz for gz=1. -outputperreadgroupsuffixO2=<_o2.fq> output file name suffix for second mates of incomplete pairs if outputperreadgroup=1. Default is _o2.fq if gz=0 and _o2.fq.gz for gz=1. outputperreadgroupsuffixS=<_s.fq> -output file name suffix for singled end reads if outputperreadgroup=1. Default is _s.fq if gz=0 and _s.fq.gz for gz=1.

Besides, what is the difference between single end reads and unmatched(orphan) when defining the output files of bamtofastq? Is _s.fastq.gz file the sum of _o1.fastq.gz and *_o2.fastq.gz ? -S=: output file for single end reads if collation is active -O=: output file for unmatched (orphan) first mates if collation is active. -O2=: output file for unmatched (orphan) second mates if collation is active.

Thanks!

keiranmraine commented 7 years ago

Single end reads are reads where the flag indication multiple segments is not set:

1 0x1 template having multiple segments in sequencing

See the spec here: section '1.4 The alignment section: mandatory fields' -> '2. FLAG'

Orphans are reads with the above flag set but only one of the pair was found in the input file.

nesilin commented 7 years ago

Thanks!