A number of tweaks to how information about consensus sequences is encoded in fastq output files.
Coordinates of sequence differences are now 1-based.
Sequence diffs are reported on the third line of the record to avoid excessively long read names.
Add a name to the beginning of the read name field. When using the CLI this is derived from the name of
the input file, using the part of the file name that precedes the first '.'.
New format for the read name field:
<name>:<uid>:<uid quality>:<cluster size>:<short>:<long>:<different>
where
name is the sample name derived from the input file
uid is the UID sequence
uid quality is the string of (ASCII encoded) phred scores for the UID sequence
cluster size is the total number of reads assigned to this cluster
short is the number of sequences that didn't contribute to the consensus because they were too
short
long is the number of sequences that didn't contribute to the consensus because they were too long
different is the number of sequences that didn't contribute to the consensus because they were
considered to be too different.
I'm not sure that storing the UID in the name field is really ideal but it does have the advantage that the UID will propagate to the BAM file without interfering with read mapping. It also ensures that read names are unique.
Coverage decreased (-0.03%) to 83.266% when pulling 86f4ec3698505e9b9faafc1f90019cfdf5a1f0a4 on sequence-diffs into 17f1b023202bd644b51579bd6f470668d799dbaf on master.
Coverage decreased (-0.03%) to 83.266% when pulling 47613a8efadd4e7ede749b3fde0f3da74421628d on sequence-diffs into 17f1b023202bd644b51579bd6f470668d799dbaf on master.
A number of tweaks to how information about consensus sequences is encoded in fastq output files.
the input file, using the part of the file name that precedes the first '.'.
<name>:<uid>:<uid quality>:<cluster size>:<short>:<long>:<different>
whereI'm not sure that storing the UID in the name field is really ideal but it does have the advantage that the UID will propagate to the BAM file without interfering with read mapping. It also ensures that read names are unique.