Sequence diffs - Githubissues

humburg commented 7 years ago

A number of tweaks to how information about consensus sequences is encoded in fastq output files.

Coordinates of sequence differences are now 1-based.
Sequence diffs are reported on the third line of the record to avoid excessively long read names.
Add a name to the beginning of the read name field. When using the CLI this is derived from the name of
the input file, using the part of the file name that precedes the first '.'.
New format for the read name field: <name>:<uid>:<uid quality>:<cluster size>:<short>:<long>:<different> where
- name is the sample name derived from the input file
- uid is the UID sequence
- uid quality is the string of (ASCII encoded) phred scores for the UID sequence
- cluster size is the total number of reads assigned to this cluster
- short is the number of sequences that didn't contribute to the consensus because they were too short
- long is the number of sequences that didn't contribute to the consensus because they were too long
- different is the number of sequences that didn't contribute to the consensus because they were considered to be too different.

I'm not sure that storing the UID in the name field is really ideal but it does have the advantage that the UID will propagate to the BAM file without interfering with read mapping. It also ensures that read names are unique.

coveralls commented 7 years ago

Coverage decreased (-0.03%) to 83.266% when pulling 86f4ec3698505e9b9faafc1f90019cfdf5a1f0a4 on sequence-diffs into 17f1b023202bd644b51579bd6f470668d799dbaf on master.

coveralls commented 7 years ago

Coverage decreased (-0.03%) to 83.266% when pulling 47613a8efadd4e7ede749b3fde0f3da74421628d on sequence-diffs into 17f1b023202bd644b51579bd6f470668d799dbaf on master.

humburg / pirates

Sequence diffs #27