philipwfowler / snpit

Whole-genome SNP based identification of members of the Mycobacterium tuberculosis complex
MIT License
9 stars 9 forks source link

Discrepancies between experimental validation and snpit prediction #12

Open codemeleon opened 4 years ago

codemeleon commented 4 years ago

Hi,

We have constructed genomes for our 18 Mtb samples, by mapping PacBio sequencing data to the Mtb reference H37Rv using SMRT Link v 2.3.0.140936. Lineages of these samples have been experimentally validated. We have three samples from lineage-1, three samples from Lineage-2 and rest twelve samples from lineage-4.

We have some additional samples whose experimental validation of lineages is unknown. We decided to use snpit on our experimentally validated samples before reporting lineages of new samples. snpit predicted as fifteen samples belonging to Lineage-4, one sample to lineage-3 and for the remaining two samples it couldn’t report. Maximum likelihood phylogeny based on whole genome alignment shows experimentally validated lineage specific samples clustering.

I do not know the cause of discrepancies between experimentally validated and snpit predicted results. Please give me advice, what I might be doing wrong.

Thank you.

mbhall88 commented 4 years ago

Hi @codemeleon ,

What did you use as input to snpit?

codemeleon commented 4 years ago

Genomes constructed by mapping PacBio sequencing data to the Mtb reference H37Rv using SMRT Link v 2.3.0.140936

mbhall88 commented 4 years ago

So your input was a fasta file?

@philipwfowler how robust is the fasta methodology? I have only really tested the VCF input method.

codemeleon commented 4 years ago

Yes

samlipworth commented 4 years ago

Hi @codemeleon sorry for the slow response, could be a few reasons for this - if you can share one of the fastas which caused a problem I'd be happy to take a look.