snayfach / MAGpurify

Improvement of metagenome-assembled genomes
GNU General Public License v3.0
47 stars 12 forks source link

tetra-freq unable to handle "N" nucleotide #16

Open mmp3 opened 4 years ago

mmp3 commented 4 years ago

The tetra-freq module crashes if there is an N nucleotide in the nucleotide sequence. An example error message is:

File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main contig.kmers[kmer_rev] += 1 KeyError: 'NTTC'

N nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.

Deletion of N nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N would just be ignored when constructing tetra-nucleotide frequency profiles.

jvollme commented 2 years ago

I noticed the same. Since scaffolding of metagenome contigs based on paired-end linkage is pretty standard, i would say this is a relatively important bug