MNP representation in vcf output

migbro commented 5 years ago

I am working on creating consensus calls in which we use lancet as one of our callers. While working on the issue of MNPs, I noticed that, from lancet, MNP calls include the leading reference base on all calla, meaning a phased call of CA -> GC would be output as TCA -> TGC. This causes some issues in consensus calling, since Mutect2 and VarDict would not include that leading reference base. It also threw off our benchmarking number before this, as similarly , in the truth set, without an adjustment, it would cause MNPs to be seen as misses, when they were indeed hits. Lastly, it also cases a problem using annotation, for instance, an unadjusted call like

11	124440144	GGG	GAA	MODERATE	OR8A1

Would have no SIFT prediction, while after adjustment:

11	124440145	GG	AA	MODERATE	OR8A1

Would have a SIFT annotation of deleterious(0)

I can easily make this adjustment myself, but I am curious as to why lancet represents MNPs this way. I am a fan of the caller by the way, and hope this question helps make it better!

gnarzisi commented 4 years ago

Thank you for your feedback. Glad to hear that you are a fan of the tool !

There is no specific reason why we adopted this format except for the fact that we were treating these multi-base mutations similarly to indels, for which the VCF format indeed requires to add the preceding base. I agree with you that it would be more convenient to remove the leading base for those to simplify comparisons to other callers. We will add this feature request to the next release of the software.

I am a little be confused about the annotation problem that you list. Since the two representations give rise to the same final sequence, they should be identical for downstream analysis. Annotation software should be smart enough to utilize the actual sequence rather than just the genomic position. Although, the SIFT prediction software does not seem to do that in your case.

migbro commented 4 years ago

Hi @gnarzisi, This is true, smarter software should be able to account for that. The situation described above occurred while using Ensembl's Variant Effect Predictor (VEP). It is surprising that this would occur given how long that software has been around. The good news is that, while going through my consensus calling process, the indel normalization step ends up treating those mnps, like indels, and when left-aligned, that leading base is removed. I suppose then we could consider this more of a bug for VEP software, and perhaps as an individual user, a warning not to blindly use the chromosome positions. Your logic makes sense and I suspected as much given that mnps are pretty much same-length insertions, so the same mechanism is probably employed. Maybe just a note in the docs/README would be good enough? Thanks for your response!

Update: I missed that comment that you were planning on implementing my suggestion next release. Sounds good! Feel free to close this issue, or close it when the update is made. Thanks again!

gnarzisi commented 4 years ago

Glad to hear that it was an easy fix for you. Closing the ticket but feel free to re-open if needed.

nygenome / lancet

MNP representation in vcf output #44