peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Examining the Undetermined reads (failed barcodes) #57

Closed peterjc closed 4 years ago

peterjc commented 5 years ago

Depending where they were sequenced, some of our plates have Undetermined_*.fastq files where the bar-code demultiplexing failed, and these can include recognisable ITS1 sequences with abundances observed up to 20k.

e.g. Run this command in the thapbo_pict prepare-reads output root folder, reports the most abundant ITS1 sequence MD5 and abundance as used in the read naming:

$ head -n 1  20*/Undetermined_*.prepared.fasta

While these abundances seem large, thus far individual biological samples on the same plate have the ITS1 sequence at over 3x higher levels, but also it appears on multiple biological samples, so there is nothing to suggest single problematic barcodes cause this.

These 96 well plates also have varying numbers of synthetic controls (e.g. 1 well, 9 wells, 18 wells) versus real samples which should be positive for ITS1 (i.e. 96 less the synthetic controls). We ought to be able to examine the number of ITS1 and synthetic control sequences found in the undetermined category, and see if the synthetic-control:real-ITS1 sample ratio is linked to this.

It might also be worth trying to determine if any particular barcodes are fragile in our setup (e.g. one could imagine unwanted secondary structure forming in certain cases).

Something to think about for the paper...

peterjc commented 4 years ago

The tool now ignores the Undetermined*.fastq files by default (they tend to be large and thus slow to process, and also the name is repeated between plates which complicates the reports).

I think on balance any problems with the Illumina barcoding is out of scope, there wasn't anything striking enough to demand a detailed investigation.