schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
323 stars 35 forks source link

VCF file duplicates question #131

Closed ahishsujay closed 2 years ago

ahishsujay commented 2 years ago

Hi (again) Manish!

While I went ahead and used bcftools merge to "merge" the 40 VCFs I have, I was creating a file to count the frequency of these "merged" variants. For example, if I have 3 samples (sampleA, sampleB, sampleC) and if variant1 is present in sampleA and sampleB, I change the value of sampleA and sampleB for variant1 to be 1, and 0 if not present in sampleC. Similarly I perform this for variant2, variant3 and so on... I hope that makes sense.

While doing this, I noticed that there are multiple same entries in the SyRI VCF files. For example, this is the .out output:

chr1 | 31259 | 31259 | C | T | chr1 | 3144 | 3144 | SNP68467 | TRANS2319 | SNP
chr1 | 31259 | 31259 | C | T | chr1 | 3145 | 3145 | SNP68468 | TRANS2319 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 3144 | 3144 | SNP69785 | SYN64 | SNP 

The reference position of the SNP seems to be the same, but has different query positions. The VCF considering is based off of the reference, has the same co-ordinates thrice. Can I have some insight on this? Is this happening as this is present in (slightly) different positions in the query co-ordinates and that they are present in different annotation blocks (TRANS and SYN respectively)? Would you recommend treating this as a single occurrence while I perform the above mentioned downstream processing?

Thanks! Very much appreciated.

Ahish

mnshgl0110 commented 2 years ago

Hi Ahish,

SNPs that are part of different annotation blocks are reported separately, even when they are at same genomic loci. And that is why there is separate SNP68467 and SNP69785. But, SNP68468 is a bit unusual. Is it the case that TRANS2319 consitute of >1 alignments with breakpoints at Ref:chr1:31259? Something like below could explain this: image Potential neighboring alignments of TRANS2319, with overlap at Ref:chr1:31259.

Checking the alignments would be the easiest way to understand what is happening exactly here.

Would you recommend treating this as a single occurrence while I perform the above mentioned downstream processing?

Yes, pragmatticaly it would be OK to consider this as one mutation. However, it is also possible that one referece (or query) coordinate is annotated as SNP at two distal coordinates (for example: one SNP in a syntenic region and other SNP in a inter-chromosomal translocation) having different mutated alleles. To genotype such regions, you would need to decide whether the focus is only on the local variation (within syntenic region) or all variation (distal SRs as well).

I do no have a simple answer on what would be the ideal strategy, but I hope this would help you in some way.

Best Manish

ahishsujay commented 2 years ago

Hi Manish,

Thanks for your quick reply. On further analysis, I realized I had omitted a few lines for this SNP and messed up the query co-ordinates while copy-pasting, sorry. Here's the complete output for this SNP:

chr1 | 31259 | 31259 | C | T | chr1 | 31449717 | 31449717 | SNP68467 | TRANS2319 | SNP
chr1 | 31259 | 31259 | C | T | chr1 | 31453833 | 31453833 | SNP68468 | TRANS2319 | SNP
chr1 | 31259 | 31259 | C | T | chr1 | 31457178 | 31457178 | SNP68469 | TRANS2319 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31449717 | 31449717 | SNP69785 | SYN64 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31453833 | 31453833 | SNP69786 | SYN64 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31457178 | 31457178 | SNP69787 | SYN64 | SNP

Looking at this and from my understanding, is this simply because of the fact that SNP is part of different annotation blocks and is in the same genomic loci but are still being reported separately? If that's not the case, I can definitely take a look at the alignments of this region and other regions where this is happening to make better sense of what is happening too.

Thanks! Ahish

mnshgl0110 commented 2 years ago

I would guess that the coordinates that you have commented above are incorrect as well :)

I would be very surprised if there are three SNPs at the same loci in a syntenic region.

ahishsujay commented 2 years ago

Oops. l just double checked by looking at the VCF file and the .out file, and the above seems to be correct. VCF file:

chr1    31259   SNP68467    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31449717;EndB=31449717;Parent=TRANS2319;VarType=ShV;DupType=.
chr1    31259   SNP68468    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31453833;EndB=31453833;Parent=TRANS2319;VarType=ShV;DupType=.
chr1    31259   SNP68469    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31457178;EndB=31457178;Parent=TRANS2319;VarType=ShV;DupType=.
chr1    31259   SNP69785    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31449717;EndB=31449717;Parent=SYN64;VarType=ShV;DupType=.
chr1    31259   SNP69786    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31453833;EndB=31453833;Parent=SYN64;VarType=ShV;DupType=.
chr1    31259   SNP69787    C   T   .   PASS    END=31259;ChrB=chr1;StartB=31457178;EndB=31457178;Parent=SYN64;VarType=ShV;DupType=.

.out file:

chr1 | 31259 | 31259 | C | T | chr1 | 31449717 | 31449717 | SNP68467 | TRANS2319 | SNP
chr1 | 31259 | 31259 | C | T | chr1 | 31453833 | 31453833 | SNP68468 | TRANS2319 | SNP
chr1 | 31259 | 31259 | C | T | chr1 | 31457178 | 31457178 | SNP68469 | TRANS2319 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31449717 | 31449717 | SNP69785 | SYN64 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31453833 | 31453833 | SNP69786 | SYN64 | SNP 
chr1 | 31259 | 31259 | C | T | chr1 | 31457178 | 31457178 | SNP69787 | SYN64 | SNP
mnshgl0110 commented 2 years ago

Hmmm.. well, I am indeed surprised. For one, this mean that there is 31MB+ region in the beginning of query genome that is not syntenic, implying that the genomes (or at least the assemblies) are very different. The SNP coordinates on the query are exactly same, which is also weird.

Have you checked the alignments? Is it the case that the loci in the query genome are covered by different alignments?

How many such cases are there? SNP number 68467 (SNP68467) is at Chr1:31259. Are all of the initial SNPs (SNP1..SNP68466) on chr1 only or other chromosomes?

My wild guess is that chr1:31259 is incomplete and it actually should be something like chr1:31259XXX so that the syntenic regions make more sense. This could be a bug. Could you please share the syri.out file, I would like check it.

ahishsujay commented 2 years ago

Hi Manish,

Extremely sorry for getting back so late! Unfortunately I cannot share the syri.out file as it is sensitive information, I'm sorry. Also, you are spot on. I (again) made an error while copy-pasting and editing the block, so it's in fact chr1:31259XXX. Away from my laptop right now and don't remember the exact coordinates. I would need to check on how many such cases there are, and will follow up soon.