marbl / harvest

Other
51 stars 11 forks source link

parsnp calculates genome size wrongly #43

Open Ekie22 opened 5 years ago

Ekie22 commented 5 years ago

Hi, I used parsnp for closed genomes with defined numbers of plasmids and recognized that the calculated genome size in the log file deviated from the given size of the input. Unfortunately this influences the calculated cluster coverage percentage. It seems that for each given plasmid (or contig) the genome size was extended by 310 bp, i.e. with 4 plasmids the genome size was overestimated by 1240 bp. Is there a reason for this behaviour? And is there a possibility to switch this off? Thanks

karoraw1 commented 5 years ago

I noticed the same thing. I was trying to identify which contigs in my fasta files were aligning within each cluster by reading through the parsnp.xmfa file, when I noticed that many of the coordinates provided were running past the entire length of my assemblies.

I kind of figured out that there is probably some sort of padding being placed between contigs during alignment. There appears to be an offset of about 1 kbp per contig, but I can tell its inexact.

However, this behaviour is only observed with fragmented assemblies. The offsets are shorter for complete reference genomes, which follows the value @Ekie22 gave.

Any info on how to convert the inflated alignment coordinates within each cluster in the xmfa files into exact loci in each fasta file would be great!