lgmgeo / liftoverSV

Lifts over a Structural Variation VCF file from one reference build to another.
1 stars 0 forks source link

SVs not lifted #1

Open tomgutman opened 1 month ago

tomgutman commented 1 month ago

Hi,

I tried using your tool which runs perfectly, but none of my SVs are lifted. I'm trying to lift SVs called with NanomonSV tool from nanopore data with T2T reference genome to hg38 to annotate them afterwards with annotSV.

Here is the output after running your tool

Any idea why the SVs are not lifted Over ?

Thanks in advance ! Tom

...unmapped SV (see WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.unmapped for details):
    - 106 SV with one position (start or end) lifted while the other doesn't
    - 0 SV with one position (start or end) mapped to a different chrom from the other
    - 0 SV with "lifted start" > "lifted end"
    - 0 SV with the distance between the two lifted positions that changes significantly (difference between both SVLENs > 5%)

Here are two examples of SVs from Nanomonsv tool :

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TUMOR   CONTROL
14      100136192       r_25    A       <DEL>   .       PASS    END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSSEQ=TATGAGTCTCCGTTTTTCCG        TR:VR   34:14   38:0
1       43651452        r_1_0   T       ]X:99416952]G   .       PASS    SVTYPE=BND;MATEID=r_1_1 TR:VR   37:4    41:1
lgmgeo commented 1 month ago

Thanks for testing liftoverSV.

Please, can you detail your request:

lgmgeo commented 1 month ago

Did you use a CHAIN_FILE without "chr" in the genomic coordinates?

tomgutman commented 1 month ago

Hi thanks for your quick answer

Here is my command line

export PATH=/mnt/beegfs/scratch/t_gutman/nanopore_data/tools:$PATH
LIFTOVERSV=/mnt/beegfs/scratch/t_gutman/nanopore_data/tools/liftoverSV
INPUT_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/WM99_vs_T2T/SV_calling/WM99_CD19_CG_Tumor.nanomonsv.result.vcf
CHAIN_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/database/chm13v2-grch38.chain
OUTPUT_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/WM99_vs_T2T/SV_calling/WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.vcf

$LIFTOVERSV/bin/liftoverSV -I $INPUT_FILE -C $CHAIN_FILE -O $OUTPUT_FILE

The Chain file comes from this github repo : https://github.com/marbl/CHM13

and the chain file contain "chr" which explain I guess why no SV were lifted Over

chain 255 chr1 248387328 + 5618 12693 chr1 248956422 + 260873 267915 0 246 1 0 281 40 0

tomgutman commented 1 month ago

Using the following command awk '{gsub(/chr/,""); print}' chm13v2-grch38.chain >chm13v2-grch38.noChr.chain I created a new chain file and running liftOverSv I obtained the following :

...unmapped SV (see WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.unmapped for details):
    - 36 SV with one position (start or end) lifted while the other doesn't
    - 0 SV with one position (start or end) mapped to a different chrom from the other
    - 4 SV with "lifted start" > "lifted end"
    - 3 SV with the distance between the two lifted positions that changes significantly (difference between both SVLENs > 5%)
lgmgeo commented 1 month ago

the chain file contain "chr" which explain I guess why no SV were lifted Over

Absolutely.

Using the following command awk '{gsub(/chr/,""); print}...

43 SVs are not mapped. 63 SVs are mapped. => I don't know what results are expected with your input file. Does this sound correct to you?

### Regarding your first SV example:

image

END - POS = 100680226 - 100136192 = 544034 => Ok, corresponds to SVLEN

### Online liftover You can have a manual check with the online liftover from the broad institute: https://liftover.broadinstitute.org//#input=14%3A100136192&hg=t2t-to-hg38 image https://liftover.broadinstitute.org//#input=14%3A100680226&hg=t2t-to-hg38 image END - POS = 106359792 - 105864636 = 495156 => The distance between the two lifted positions changes significantly (difference between both SVLENs > 5%) => This SV is unmapped

I have no feedback on this 5% parameter. Perhaps this criterion is too strict? I will add an option so that this threshold can be set by the user

I am really interested in any feedback, recommendation for liftover.

tomgutman commented 1 month ago

i wasn't expecting so much differences between the two reference genome. I am looking into it

Regarding the difference between both SVLENs > 5% :

Here are the records with this flag :

14      100136192       r_25    A       <DEL>   .       PASS    END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS
EQ=TATGAGTCTCCGTTTTTCCG        the distance between the two lifted positions changes significantly (svlen diff > 5%)
14      99858805        r_24    T       <DUP>   .       Too_low_VAF     END=100049324;SVTYPE=DUP;SVLEN=190519;SVINSLEN=6;
SVINSSEQ=GATTAG        the distance between the two lifted positions changes significantly (svlen diff > 5%)
3       51295313        r_42    T       <DUP>   .       PASS    END=104273107;SVTYPE=DUP;SVLEN=52977794;SVINSLEN=30;SVINS
SEQ=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     the distance between the two lifted positions changes significantly (svlen diff >
5%)

Chr14 is acrocentric and has seen a lot of changes between hg38 and T2T according to the main publication of T2T (Nurk, S. et al. (2022). The complete sequence of a human genome. Science.). The duplication of chromosome 3 seems to span the centromere which could also explain the difference > 5%

It would be nice indeed to have a user define parameter. And in my case a 10% cutoff will not filter theses SVs.

Regarding the 36 SV with one position (start or end) lifted while the other doesn't I looked at a few of them and indeed they seem to fall in newly sequenced regions (centromere, telomere...).

Regarding the 4 SV with "lifted start" > "lifted end"

22      9777730 d_191   A       <DEL>   .       PASS    END=9777919;SVTYPE=DEL;SVLEN=-189       lifted_start (12196812) > lifted_end (12196623)
16      27089218        d_101   A       <INS>   .       PASS    END=27089206;SVTYPE=INS;SVINSLEN=196;SVINSSEQ=TCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTAT      lifted_start (26812465) > lifted_end (26812453)
3       198617703       r_48    T       <DEL>   .       PASS    END=198620350;SVTYPE=DEL;SVLEN=-2647;SVINSLEN=29;SVINSSEQ
=ATTATTTAATTATTTAATAATTATAATTA lifted_start (195749160) > lifted_end (195746511)
3       198620723       r_49    G       <DEL>   .       PASS    END=198639737;SVTYPE=DEL;SVLEN=-19014   lifted_start (195
746138) > lifted_end (195727080)

Finally, I found that the lifted coordinates in the vcf doesn't match the ones found in the online liftover from the broad institute : Is it linked to different Chm13-T2T versions ? For example :

16      27089218        d_101   A       <INS>   .       PASS    END=27089206;SVTYPE=INS;SVINSLEN=196;SVINSSEQ=TCACATAGTTG
TATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATC
ACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTAT      lifted_start (26812465) > lifted_end (26812453)

Coordinates T2T :

Coordiantes hg38 lifted (written in the vcf file)

Coordinates hg38 lifted (from https://liftover.broadinstitute.org//#input=22%3A9777730&hg=t2t-to-hg38)

lgmgeo commented 1 month ago

Regarding the difference between both SVLENs > 5% :

It would be nice indeed to have a user define parameter. And in my case a 10% cutoff will not filter theses SVs.

Done with the liftoverSV version v0.1.2_beta See --PERCENT, P option

In your institute, do you have access to an hg38 BAM/CRAM and a T2T BAM/CRAM for the same sample? It would be nice to check the aligned reads (with IGV for example) for these SVs. And to see if such SV whose SVLEN has increased by 10% is well covered. If this is possible, it would be great to have such a feedback ;o)

Regarding the 4 SV with "lifted start" > "lifted end"

In my opinion, the difference should be due to a different version of the chain file used. What is the version of your chain file (https://github.com/marbl/CHM13)? For example, the chm13v2 is avalaible here: https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/liftOver/hg38-chm13v2.over.chain.gz

tomgutman commented 1 month ago

Hi,

I have indeed access to one sample aligned on hg38 and T2T. Below are two screenshot of IGV windows. I used hg38 as a reference genome here.

In this IGV windows there is 4 tracks (2 top are hg38, 2 bottom are T2T), WM99_CD3 is considered the normal sample and WM99_CD19 is the tumor sample.

On those two screenshots, which shows the two breakpoint of a SV deletion (VCF info below) we can clearly see a structural variant identified in T2T alignement but not in hg38 alignement. Moreover, this event is well covered and has high coverage. For me there is no doubt it's a real event. But I have a hard time understanding precisely why it is specific to T2T. Probably, as mentionned previously, Chr14 is acrocentric and has seen changes in its telomere part between hg38 and T2T

14 100136192 r_25 A DEL . PASS END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS EQ=TATGAGTCTCCGTTTTTCCG the distance between the two lifted positions changes significantly (svlen diff > 5%)
Capture d’écran 2024-07-24 à 11 26 41 Capture d’écran 2024-07-24 à 11 25 41
tomgutman commented 1 month ago

Here is another example, this time with a duplication. We can see an increased of the coverage at the breakpoints

3 51295313 r_42 T <DUP> . PASS END=104273107;SVTYPE=DUP;SVLEN=52977794;SVINSLEN=30;SVINS SEQ=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA the distance between the two lifted positions changes significantly (svlen diff > 5%)
Capture d’écran 2024-07-24 à 11 40 12 Capture d’écran 2024-07-24 à 11 40 48
tomgutman commented 1 month ago

Finally, i used chm13v2-grch38.chain for the liftOver, which I found here : https://github.com/marbl/CHM13?tab=readme-ov-file

lgmgeo commented 1 month ago

Below are two screenshot of IGV windows. I used hg38 as a reference genome here.

Not sure to understand what you are showing (both on the same hg38 reference genome)

My idea was to look for example the following SV :

tomgutman commented 1 month ago

Hi, following up on this analysis, I'm having a hard time understanding the following variant. I'm wondering if maybe there is a bug with the liftover of the bracket notation

Before liftOverSV : 
10  59997128    r_9_0   A   A]7:83411098]   .   PASS    SVTYPE=BND;MATEID=r_9_1 TR:VR   47:3    41:0
7   83411098    r_9_1   T   T]10:59997128]  .   PASS    SVTYPE=BND;MATEID=r_9_0 TR:VR   47:3    41:0

After liftOverSV : 
10  59143169    r_9_0   A   A]7:83411098]   .   PASS    SVTYPE=BND;MATEID=r_9_1 TR:VR   47:3    41:0
7   82159848    r_9_1   T   T]10:59997128]  .   PASS    SVTYPE=BND;MATEID=r_9_0 TR:VR   47:3    41:0

The coordinates don't match between the two mates after liftOverSV.

What do you think ?

tomgutman commented 1 month ago

For the SV comparison between T2T and hg38 for the same sample :

T2T :

14 100136192 r_25 A DEL . PASS END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS EQ=TATGAGTCTCCGTTTTTCCG the distance between the two lifted positions changes significantly (svlen diff > 5%)

global :

Capture d’écran 2024-07-31 à 10 34 45

BND left :

Capture d’écran 2024-07-31 à 10 57 16

BND right :

Capture d’écran 2024-07-31 à 10 58 05

hg38 :

14  105864636   r_25    A   <DEL>   .   PASS    END=106359792;SVTYPE=DEL;SVLEN=-495156;SVINSLEN=20;SVINSSEQ=TATGAGTCTCCGTTTTTCCG    TR:VR   34:14   38:0

global :

Capture d’écran 2024-07-31 à 11 37 25

BND left :

Capture d’écran 2024-07-31 à 11 37 46

BND right :

Capture d’écran 2024-07-31 à 11 38 02

For this SV, on chr14, the region is well covered in T2T and in hg38. To note we are in the VDJ region, which goes through a lot of recombination, moreover, my samples are B and T cells, so we can see some recombination events in the B cells

lgmgeo commented 1 month ago

I'm wondering if maybe there is a bug with the liftover of the bracket notation

Absolutely! I was aware of this, as stated at the end of the "Feature requested for future release" section: image

I developed this tool for an analysis in my lab where SVs were not described with square-bracketed notation. But if liftoverSV is used by other users, I can continue the development. It may take some time (I am out of my lab in August) but I will do my best to add this feature as soon as possible.

lgmgeo commented 1 month ago

For the SV comparison between T2T and hg38 for the same sample ...
For this SV, on chr14, the region is well covered in T2T and in hg38. To note we are in the VDJ region, which goes through a lot of recombination, moreover, my samples are B and T cells, so we can see some recombination events in the B cells

Excellent! Thank you for showing this specific example with so much details! I'm absolutely not an expert in cancer and your example in the VDJ region is really interresting.

This is a complex SV (not really a single DEL in my opinion).

Actually, I'm quite surprised with the BND coverage: CD3 (normal), BND left [0-25] CD19 (tumor), BND left [0-18] => looks like an homozygous DEL on the screenshot

CD3 (normal), BND right [0-17] CD19 (tumor), BND right [0-32] => looks like an heterozygous DUP on the screenshot

CD3 (normal), between BNDs [0-43] CD19 (tumor), between BNDs [0-32] => looks like an heterozygous deletion?? not sure, so complex. => We can see the V(D)J recombination (variable–diversity–joining rearrangement), with the multiple alleles. Really nice.

Finally, it looks like you may indeed use a cutoff of 10% for this complex SV.

tomgutman commented 1 month ago

Thanks for all this information,

Why are you suprised by the BND coverage ? You would expect it to be less covered ? For the BND right, how do you determine it is an heterozygous DUP ? I understand that it is a DUP by the bump in the coverage but not the heterozygous part. If it was homozygous DUP you would expect double coverage compared to the other side of the BND ? (

Also you have a very fine analysis of the global V(D)J SV. Would you be keen on explaining how you analyzed the global complex SV ?

Thanks again for your time and effort. It's really helpful !

lgmgeo commented 1 month ago

Hello Tom, Maybe it would be better to talk together directly by phone. I will contact you by email to try to define a time slot. Véronique