samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

Check for invalid BC tags in fastq output. #1518

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

I don't know how best to deal with such situations, so rather than a hard error we treat the barcode component the same as if no BC tag was stored - with "0". It also whinges, and you'll probably get this message many times, but take that as "encouragement" to fix the input data!

Also fixed the barcode to be uppercase, incase of e.g. "ac+gt". I'm not sure if it's required, but we may as well.

Fixes samtools/samtools#1728

Note it's still clear as mud as to what this field really should contain. The Illumina documentation is pretty atrocious here.

The old (deprecated) bcl2fastq had the most information: https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2-v2-20-software-guide-15051736-03.pdf but it says "index sequence or sample number", so a numeric is permitted here but not in our source BC tag (as per SAMtags spec). It doesn't say anything about the mark up of sequence, or even mention the ability of +, - or other separators in multi-barcodes.

The replacement tool bcl_convert is even more opaque:https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl_convert/bcl-convert-v3-7-5-software-guide-1000000163594-00.pdf. That doesn't even have a single example of what fastq looks like or acknowledge the existance of the data after the identifier.

Finally the latest is the DRAGEN bcl conversion, but that too is opaque: https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/ToolsiBCL_fDG.htm. I'm guessing it's because DRAGEN doesn't generate fastq as it goes direct from BCL to alignments.

None of these mention "casava", which was the original software and how our options and documentation is named, so a huge shrug on trying to figure out what we're meant to do.

daviesrob commented 1 year ago

The check for valid sequence bases is a bit generous, but I guess we can tighten it up later if we need to.

jkbonfield commented 1 year ago

I was struggling to work out how far to go, because I don't know what is permitted in BC tags. The spec claims "sequence", but does that include ambiguity codes? In theory it could, if we have degenerate tags and use multiple barcodes mixed together. So then it comes down to whether to check for the non-IUPAC codes, such as "J" and "Z". In the end I figured it's likely just not worth it as the problem in reality is caused by numerics and/or punctuation.