The tetra-freq module crashes if there is an N nucleotide in the nucleotide sequence.
An example error message is:
File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main
contig.kmers[kmer_rev] += 1
KeyError: 'NTTC'
N nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.
Deletion of N nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N would just be ignored when constructing tetra-nucleotide frequency profiles.
I noticed the same. Since scaffolding of metagenome contigs based on paired-end linkage is pretty standard, i would say this is a relatively important bug
The
tetra-freq
module crashes if there is anN
nucleotide in the nucleotide sequence. An example error message is:N
nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.Deletion of
N
nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with anN
would just be ignored when constructing tetra-nucleotide frequency profiles.