This PR addresses a serious bug that affected the TNF frequency module.
The old code updated the 4-mer counting dictionary of each conting by accessing it though the contigs dictionary:
for contig in contigs.values():
start, stop, step = 0, 4, 1
while stop <= len(contig.seq):
kmer_fwd = contig.seq[start:stop]
kmer_rev = str(Bio.Seq.Seq(kmer_fwd).reverse_complement())
if kmer_fwd in kmer_counts:
contigs[rec.id].kmers[kmer_fwd] += 1 # <- Here
elif kmer_rev in kmer_counts:
contigs[rec.id].kmers[kmer_rev] += 1 # <- And here
start += step
stop += step
By reaching the counts dictionary though the contigs dictionary, the counts were lost at the beginning of each iteration. Consequently, only the last contig would have correct 4-mer counts and every other contigs would have 0 counts for every 4-mer.
The practical consequence of it was that for the majority of MAGs, the PC1 mean would be close to 0 and the last contig would be flagged as an outlier. For MAGs with few contigs, the PC1 mean would be midway between 0 and 1 and all the contigs would be flagged as contaminants. Indeed, I've noticed that in my samples no MAG had no contigs flagged by the TNF module.
In the new code, the counts are reached directly via the contig object and are not lost after each iteration.
for contig in contigs.values():
for i in range(len(contig.seq) - 3):
kmer_fwd = contig.seq[i : i + 4]
if kmer_fwd in contig.kmers:
contig.kmers[kmer_fwd] += 1 # <- Here
else:
kmer_rev = utility.reverse_complement(kmer_fwd)
contig.kmers[kmer_rev] += 1 # <- And here
With the new code, no outlier contig is flagged by the tetra-freq module in the test FASTA (previously, the last contig was flagged as a outlier):
$ ./run_qc.py tetra-freq example/test.fna example/test_output
## Counting tetranucleotides
## Normalizing counts
## Performing PCA
## Computing per-contig deviation from the mean along the first principal component
## Identifying outlier contigs
0 flagged contigs: example/test_output/tetra-freq/flagged_contigs
I don't know if the buggy function was the one used to tune the cutoff value. If so, maybe it is best to re-evaluate the threshold (?).
Hi Stephen!
This PR addresses a serious bug that affected the TNF frequency module.
The old code updated the 4-mer counting dictionary of each conting by accessing it though the
contigs
dictionary:By reaching the counts dictionary though the
contigs
dictionary, the counts were lost at the beginning of each iteration. Consequently, only the last contig would have correct 4-mer counts and every other contigs would have 0 counts for every 4-mer.The practical consequence of it was that for the majority of MAGs, the PC1 mean would be close to 0 and the last contig would be flagged as an outlier. For MAGs with few contigs, the PC1 mean would be midway between 0 and 1 and all the contigs would be flagged as contaminants. Indeed, I've noticed that in my samples no MAG had no contigs flagged by the TNF module.
In the new code, the counts are reached directly via the
contig
object and are not lost after each iteration.With the new code, no outlier contig is flagged by the
tetra-freq
module in the test FASTA (previously, the last contig was flagged as a outlier):I don't know if the buggy function was the one used to tune the cutoff value. If so, maybe it is best to re-evaluate the threshold (?).