liguowang / CrossMap

CrossMap is a python program to lift over genome coordinates from one genome version to another.
https://crossmap.readthedocs.io/en/latest/
Other
64 stars 23 forks source link

MAF liftover resulting in same reference and alternate allele #68

Open skanwal opened 4 months ago

skanwal commented 4 months ago

Hello,

Thanks for this useful utility. I have data in MAF format (NCBI build 37). I am trying to lift it over to hg38 using the following Crossmap (v0.6.6) command:

CrossMap maf b37ToHg38.over.chain \\ 
PAAD_atlas.tmp.maf \\
/work/genomes/Hsapiens/hg38/seq/hg38.fa hg38 \\
/explore/liftover_maf/PAAD_atlas.liftover.maf \\
--chromid l

Liftover file was downloaded from https://github.com/broadinstitute/gatk/blob/083aac832cb64515fd0456008bf847dd22f6c234/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain

The command runs successfully with following output:

2024-02-26 10:29:50 [INFO]  Read the chain file "/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain"
2024-02-26 10:29:51 [INFO]  Lifting over ...
2024-02-26 10:33:58 [INFO]  Total entries: 6630811
2024-02-26 10:33:58 [INFO]  Failed to map: 1372

However, after inspecting the output I have realised that the Reference and Tumor_Seq_Allele2 are both the same in the lifted over maf file. For example, the head of output looks like:

$ head PAAD_atlas.liftover.maf
#liftOver: Program=CrossMapv0.6.6, Time=February26,2024, ChainFile=/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain, NewRefGenome=/g/data3/gx8/local/development/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
Hugo_Symbol sample_id   Hugo_Symbol NCBI_Build  Chromosome  Start_Position  End_Position    Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele2   Tumor_Sample_Barcode    HGVSp_Short aa_mutation
1   Avner-primary_tissue_subset FAM231B hg38    chr1    16539492    16539492    Missense_Mutation   SNP C   C   p010_tumor-52fccd-somatic.pcgr.vcf  p.R143C NA
2   Avner-primary_tissue_subset ZMYM4   hg38    chr1    35389029    35389029    Nonsense_Mutation   SNP G   G   p010_tumor-52fccd-somatic.pcgr.vcf  p.E795* NA
3   Avner-primary_tissue_subset COL8A2  hg38    chr1    36099236    36099236    Nonsense_Mutation   SNP G   G   p010_tumor-52fccd-somatic.pcgr.vcf  p.R149* NA
4   Avner-primary_tissue_subset PTGER3  hg38    chr1    70953763    70953763    Missense_Mutation   SNP T   T   p010_tumor-52fccd-somatic.pcgr.vcf  p.Q368H NA
5   Avner-primary_tissue_subset C1orf52 hg38    chr1    85259561    85259561    Missense_Mutation   SNP C   C   p010_tumor-52fccd-somatic.pcgr.vcf  p.E25K  NA
6   Avner-primary_tissue_subset AMY2A   hg38    chr1    103617550   103617550   Missense_Mutation   SNP T   T   p010_tumor-52fccd-somatic.pcgr.vcf  p.V37D  NA
7   Avner-primary_tissue_subset TNR hg38    chr1    175391305   175391305   Missense_Mutation   SNP G   G   p010_tumor-52fccd-somatic.pcgr.vcf  p.S497L NA
8   Avner-primary_tissue_subset LAMC2   hg38    chr1    183218424   183218424   Missense_Mutation   SNP G   G   p010_tumor-52fccd-somatic.pcgr.vcf  p.A147T NA

In comparison, the head of original (genome build 37) file is:

$ head PAAD_atlas.tmp.maf
Hugo_Symbol sample_id   Hugo_Symbol NCBI_Build  Chromosome  Start_Position  End_Position    Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele2   Tumor_Sample_Barcode    HGVSp_Short aa_mutation
1   Avner-primary_tissue_subset FAM231B 37  1   16865987    16865987    Missense_Mutation   SNP C   T   p010_tumor-52fccd-somatic.pcgr.vcf  p.R143C NA
2   Avner-primary_tissue_subset ZMYM4   37  1   35854630    35854630    Nonsense_Mutation   SNP G   T   p010_tumor-52fccd-somatic.pcgr.vcf  p.E795* NA
3   Avner-primary_tissue_subset COL8A2  37  1   36564837    36564837    Nonsense_Mutation   SNP G   A   p010_tumor-52fccd-somatic.pcgr.vcf  p.R149* NA
4   Avner-primary_tissue_subset PTGER3  37  1   71419446    71419446    Missense_Mutation   SNP T   G   p010_tumor-52fccd-somatic.pcgr.vcf  p.Q368H NA
5   Avner-primary_tissue_subset C1orf52 37  1   85725244    85725244    Missense_Mutation   SNP C   T   p010_tumor-52fccd-somatic.pcgr.vcf  p.E25K  NA
6   Avner-primary_tissue_subset AMY2A   37  1   104160172   104160172   Missense_Mutation   SNP T   A   p010_tumor-52fccd-somatic.pcgr.vcf  p.V37D  NA
7   Avner-primary_tissue_subset TNR 37  1   175360441   175360441   Missense_Mutation   SNP G   A   p010_tumor-52fccd-somatic.pcgr.vcf  p.S497L NA
8   Avner-primary_tissue_subset LAMC2   37  1   183187559   183187559   Missense_Mutation   SNP G   A   p010_tumor-52fccd-somatic.pcgr.vcf  p.A147T NA
9   Avner-primary_tissue_subset OBSCN   37  1   228434396   228434396   Missense_Mutation   SNP G   A   p010_tumor-52fccd-somatic.pcgr.vcf  p.A1401T    NA

It seems the program is updating both reference and alternate alleles. Can you please help me debug the issue? Thanks.

liguowang commented 4 months ago

Hello,

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover. I checked, all the reference alleles in your "PAAD_atlas.liftover.maf" file are indeed matched to the sequence of hg38/GRCh38.

This probably indicates the somatic variants in your "PAAD_atlas.tmp.maf" file are NOT real mutations.

Hope this helps.

Liguo

skanwal commented 4 months ago

Thanks for the response, Liguo.

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.

That part makes sense. What I am unsure is why the Tumor_Seq_Allele2 has been updated? For example for row one (Hugo_Symbol - FAM231B) - Tumor_Seq_Allele2 has been updated to C as compared to T in my original file. Should Tumor_Seq_Allele2 still be T as it wouldn't have been updated as part of liftover?

I am finding it hard to understand how so many calls in my "PAAD_atlas.tmp.maf" are not real.

I have done MAF analysis on my v37 data and the SNV numbers are as below:

Screen Shot 2024-02-27 at 6 08 54 am

Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls:

Screen Shot 2024-02-27 at 6 12 57 am

Which suggests I am losing 98% of variants after liftover?

liguowang commented 4 months ago

Oh, I know what happened!! It looks like a bug. I will update and release a new version soon. Thanks

liguowang commented 4 months ago

Thanks for the response, Liguo.

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.

That part makes sense. What I am unsure is why the Tumor_Seq_Allele2 has been updated? For example for row one (Hugo_Symbol - FAM231B) - Tumor_Seq_Allele2 has been updated to C as compared to T in my original file. Should Tumor_Seq_Allele2 still be T as it wouldn't have been updated as part of liftover?

I am finding it hard to understand how so many calls in my "PAAD_atlas.tmp.maf" are not real.

I have done MAF analysis on my v37 data and the SNV numbers are as below:

Screen Shot 2024-02-27 at 6 08 54 am

Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls: Screen Shot 2024-02-27 at 6 12 57 am

Which suggests I am losing 98% of variants after liftover?

After read the MAF specificaion again, I think the original code is still correct. According to https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, the Ref_Allele is in the 11th column, while your input file has the Ref_Allele in the 10th column, please double check.

skanwal commented 4 months ago

Many thanks @liguowang - updating order of columns in my original data has resolved the issue.

Can I ask another question - how is CrossMap handling liftover of insertion in MAF. For example an entry in v37 MAF was:

sample_id   Hugo_Symbol NCBI_Build  Chromosome  Start_Position  End_Position    Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele2   Tumor_Sample_Barcode    HGVSp_Short aa_mutation
KRAS-wt_Croagh_subset   LINC00383   37  13  69791655    69791656    5'Flank INS -   T   MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf     NA

After liftover to hg38 it becomes:

sample_id   Hugo_Symbol NCBI_Build  Chromosome  Start_Position  End_Position    Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele2   Tumor_Sample_Barcode    HGVSp_Short aa_mutation
KRAS-wt_Croagh_subset   LINC00383   hg38    chr13   69217523    69217524    5'Flank INS AT  T   MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf     NA

Can you please explain the rationale behind updating Reference_Allele from - to AT in this case or in general changing from - to actual nucleotides on that position for INS?

liguowang commented 4 months ago

"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).

"AT/T" indicates an "A" was inserted into the REF genome.

skanwal commented 4 months ago

"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).

Indeed MAF has this rule to represent insertion using -. From https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/:

11 - Reference_Allele | The plus strand reference allele at this position. Includes the deleted sequence for a deletion or "-" for an insertion

"AT/T" indicates an "A" was inserted into the REF genome.

According to VCF spec, this would indicate a deletion i.e. A from the reference was deleted in the alternate allele.

Based on this, would it make sense to stick to MAF spec and indicate INSERTIONS using - in the Reference_Allele column in the output of CrossMap maf module?