snayfach / MAGpurify

Improvement of metagenome-assembled genomes
GNU General Public License v3.0
47 stars 12 forks source link

Fix TNF counting bug #9

Closed apcamargo closed 4 years ago

apcamargo commented 4 years ago

Hi Stephen!

This PR addresses a serious bug that affected the TNF frequency module.

The old code updated the 4-mer counting dictionary of each conting by accessing it though the contigs dictionary:

for contig in contigs.values():
    start, stop, step = 0, 4, 1
    while stop <= len(contig.seq):
        kmer_fwd = contig.seq[start:stop]
        kmer_rev = str(Bio.Seq.Seq(kmer_fwd).reverse_complement())
        if kmer_fwd in kmer_counts:
            contigs[rec.id].kmers[kmer_fwd] += 1 # <- Here
        elif kmer_rev in kmer_counts:
            contigs[rec.id].kmers[kmer_rev] += 1 # <- And here
        start += step
        stop += step

By reaching the counts dictionary though the contigs dictionary, the counts were lost at the beginning of each iteration. Consequently, only the last contig would have correct 4-mer counts and every other contigs would have 0 counts for every 4-mer.

      BM_RX_k147_8239850  BM_RX_k147_6674881  BM_RX_k147_4702068  BM_RX_k147_5939754  BM_RX_k147_10223520  ...  BM_RX_k147_12391359  BM_RX_k147_10772737  BM_RX_k147_1789218  BM_RX_k147_8881000  BM_RX_k147_3369996
AAAA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.548125
AAAC                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.618637
AAAG                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.618531
AAAT                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.474334
AACA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.340432
...                  ...                 ...                 ...                 ...                  ...  ...                  ...                  ...                 ...                 ...                 ...
TCCA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.570689
TCGA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            1.005713
TGAA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.555670
TGCA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.350127
TTAA                 0.0                 0.0                 0.0                 0.0                  0.0  ...                  0.0                  0.0                 0.0                 0.0            0.070759

The practical consequence of it was that for the majority of MAGs, the PC1 mean would be close to 0 and the last contig would be flagged as an outlier. For MAGs with few contigs, the PC1 mean would be midway between 0 and 1 and all the contigs would be flagged as contaminants. Indeed, I've noticed that in my samples no MAG had no contigs flagged by the TNF module.

In the new code, the counts are reached directly via the contig object and are not lost after each iteration.

for contig in contigs.values():
    for i in range(len(contig.seq) - 3):
        kmer_fwd = contig.seq[i : i + 4]
        if kmer_fwd in contig.kmers:
            contig.kmers[kmer_fwd] += 1 # <- Here
        else:
            kmer_rev = utility.reverse_complement(kmer_fwd)
            contig.kmers[kmer_rev] += 1 # <- And here
      BM_RX_k147_8239850  BM_RX_k147_6674881  BM_RX_k147_4702068  BM_RX_k147_5939754  BM_RX_k147_10223520  ...  BM_RX_k147_12391359  BM_RX_k147_10772737  BM_RX_k147_1789218  BM_RX_k147_8881000  BM_RX_k147_3369996
AAAA            0.483686            0.537012            0.583820            0.994783             0.521565  ...             0.574912             0.533808            0.497720            0.604319            0.476009
AAAC            0.555569            0.619985            0.641373            0.485260             0.623403  ...             0.613366             0.510083            0.584132            0.680888            0.588347
AAAG            0.566534            0.629669            0.620383            0.788548             0.591995  ...             0.641376             0.525900            0.564696            0.717938            0.577875
AAAT            0.452009            0.475608            0.479379            0.740022             0.465411  ...             0.486610             0.411230            0.451494            0.524457            0.404608
AACA            0.269256            0.321327            0.376969            0.376077             0.315984  ...             0.343238             0.233294            0.310714            0.439655            0.282749
...                  ...                 ...                 ...                 ...                  ...  ...                  ...                  ...                 ...                 ...                 ...
TCCA            0.554351            0.566063            0.577388            0.570181             0.575815  ...             0.563043             0.601028            0.565484            0.582089            0.522658
TCGA            1.090426            1.031987            0.972639            0.521655             1.050267  ...             1.020219             1.095295            1.026170            0.897422            1.013899
TGAA            0.527547            0.567164            0.557922            0.618707             0.542979  ...             0.551175             0.462633            0.514792            0.638899            0.523610
TGCA            0.347231            0.345536            0.363089            0.279025             0.362145  ...             0.341814             0.336101            0.341444            0.355676            0.331302
TTAA            0.053608            0.067567            0.077865            0.169841             0.056154  ...             0.072635             0.047450            0.061197            0.102092            0.060929

With the new code, no outlier contig is flagged by the tetra-freq module in the test FASTA (previously, the last contig was flagged as a outlier):

$ ./run_qc.py tetra-freq example/test.fna example/test_output              

## Counting tetranucleotides

## Normalizing counts

## Performing PCA

## Computing per-contig deviation from the mean along the first principal component

## Identifying outlier contigs
   0 flagged contigs: example/test_output/tetra-freq/flagged_contigs

I don't know if the buggy function was the one used to tune the cutoff value. If so, maybe it is best to re-evaluate the threshold (?).