plagnollab / DNASeq_pipeline

Pipeline in place at the UGI for DNA level analysis
10 stars 8 forks source link

reformatting of VEP output for R and Excel #17

Closed pontikos closed 9 years ago

pontikos commented 9 years ago

We should explore whether it might be better to use:

The --pick option to give one effect per variant; The --fields option to output the data as a csv.

A script to perform allele specific analyses (e.g. G>A:X, G>T:Y, etc.) will still be required afterwards but this way some of other tasks are taken over by the VEP.

pontikos commented 9 years ago

We have decided with @APLevine that we will keep all transcripts on one line but remove the information which is not relevant. http://www.ensembl.org/info/docs/tools/vep/vep_formats.html

There is also the --most_severe option which only keep the most severe consequence per transcript. http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html

pontikos commented 9 years ago

Please see example output here: /cluster/project8/IBDAJE/Annotation/chr22-esp-exac.vcfout

## ENSEMBL VARIANT EFFECT PREDICTOR v78
## Output produced at 2015-01-26 11:34:42
## Connected to homo_sapiens_core_78_38 on ensembldb.ensembl.org
## Using cache in /cluster/project8/vyp/AdamLevine/software/ensembl/cache//homo_sapiens/78_GRCh38
## Using API version 78, DB version 78
## sift version sift5.0.2
## polyphen version 2.2.2
## Extra column keys:
## DISTANCE : Shortest distance from variant to transcript
## STRAND : Strand of the feature (1/-1)
## SYMBOL : Gene symbol (e.g. HGNC)
## SYMBOL_SOURCE : Source of gene symbol
## HGNC_ID : Stable identifer of HGNC gene symbol
## CANONICAL : Indicates if transcript is canonical for this gene
## SIFT : SIFT prediction and/or score
## PolyPhen : PolyPhen prediction and/or score
## GMAF : Minor allele and frequency of existing variant in 1000 Genomes Phase 1 combined population
## AFR_MAF : Frequency of existing variant in 1000 Genomes Phase 1 combined African population
## AMR_MAF : Frequency of existing variant in 1000 Genomes Phase 1 combined American population
## ASN_MAF : Frequency of existing variant in 1000 Genomes Phase 1 combined Asian population
## EUR_MAF : Frequency of existing variant in 1000 Genomes Phase 1 combined European population
## AA_MAF : Frequency of existing variant in NHLBI-ESP African American population
## EA_MAF : Frequency of existing variant in NHLBI-ESP European American population
## CLIN_SIG : Clinical significance of variant from dbSNP
## SOMATIC : Somatic status of existing variant
## CAROL        : Combined Annotation scoRing toOL prediction
## Condel       : Consensus deleteriousness score for an amino acid substitution based on SIFT and PolyPhen-2
## EXAC_AFR : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_AFR.vcf.gz (exact)
## EXAC_AMR : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_AMR.vcf.gz (exact)
## EXAC_Adj : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_Adj.vcf.gz (exact)
## EXAC_EAS : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_EAS.vcf.gz (exact)
## EXAC_FIN : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_FIN.vcf.gz (exact)
## EXAC_NFE : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_NFE.vcf.gz (exact)
## EXAC_OTH : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_OTH.vcf.gz (exact)
## EXAC_SAS : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/ExAC/0.3/chr22_SAS.vcf.gz (exact)
## 1KG_EUR : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/1kg/chr22_EUR.vcf.gz (exact)
## 1KG_AFR : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/1kg/chr22_AFR.vcf.gz (exact)
## 1KG_AMR : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/1kg/chr22_AMR.vcf.gz (exact)
## 1KG_ASN : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/1kg/chr22_ASN.vcf.gz (exact)
## ESP_EA : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/esp/chr22_EA.vcf.gz (exact)
## ESP_AA : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/esp/chr22_AA.vcf.gz (exact)
## UCLEX : /cluster/project8/IBDAJE/VEP_custom_annotations/hg38_noAlt/UCLex/chr22.vcf.gz (exact)
#Uploaded_variation     Location        Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids  Codons  Existing_variation      Extra
22_11065978_C/T 22:11065978     T       ENSG00000280363 ENST00000623473 Transcript      missense_variant        50      50      17      R/Q     cGg/cAg -   STRAND=-1;SYMBOL=CU104787.1;SYMBOL_SOURCE=Clone_based_ensembl_gene;CANONICAL=YES;PolyPhen=unknown(0)
22_11065979_G/A 22:11065979     A       ENSG00000280363 ENST00000623473 Transcript      missense_variant        49      49      17      R/W     Cgg/Tgg -   STRAND=-1;SYMBOL=CU104787.1;SYMBOL_SOURCE=Clone_based_ensembl_gene;CANONICAL=YES;PolyPhen=unknown(0)
22_11065984_C/T 22:11065984     T       ENSG00000280363 ENST00000623473 Transcript      missense_variant        44      44      15      R/Q     cGa/cAa -   STRAND=-1;SYMBOL=CU104787.1;SYMBOL_SOURCE=Clone_based_ensembl_gene;CANONICAL=YES;PolyPhen=unknown(0)
22_11065994_C/T 22:11065994     T       ENSG00000280363 ENST00000623473 Transcript      missense_variant        34      34      12      E/K     Gag/Aag -   STRAND=-1;SYMBOL=CU104787.1;SYMBOL_SOURCE=Clone_based_ensembl_gene;CANONICAL=YES;PolyPhen=unknown(0)
22_11065995_G/A 22:11065995     A       ENSG00000280363 ENST00000623473 Transcript      synonymous_variant      33      33      11      T       acC/acT -   STRAND=-1;SYMBOL=CU104787.1;SYMBOL_SOURCE=Clone_based_ensembl_gene;CANONICAL=YES
22_11066503_G/A 22:11066503     A       ENSG00000279973 ENST00000624155 Transcript      initiator_codon_variant 3       3       1       M/I     atG/atA -   Condel=neutral(0.450);CAROL=Deleterious(0.999);STRAND=1;SYMBOL=BAGE5;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:15732;CANONICAL=YES;SIFT=deleterious(0);PolyPhen=benign(0.072)
22_11068023_A/G 22:11068023     G       ENSG00000279973 ENST00000624155 Transcript      synonymous_variant      54      54      18      R       agA/agG -   STRAND=1;SYMBOL=BAGE5;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:15732;CANONICAL=YES

I am going to work on the formatting now.

pontikos commented 9 years ago

@APLevine please have a look at /cluster/project8/IBDAJE/Annotation/GRCh37/processed-chr22-esp-exac.vcfout

I still need to split multiple alt alleles onto separate lines.

APLevine commented 9 years ago

Nikolas,

That looks good.

I think if it is easier if you split the multiple alternate alleles before parsing the data to VEP.

It seems that some fields are missing. According to the VEP, for each output you should have the following in addition to the custom annotation: "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position| CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|CANONICAL|SIFT|PolyPhen|GMAF|AFRMAF|AMR MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|CAROL|Condel"

Is there a reason why these are not all in the output?

You also have not included the genotype data. That is okay for now.

Adam

Adam P. Levine

On 27 January 2015 at 19:17, Nikolas Pontikos notifications@github.com wrote:

@APLevine https://github.com/APLevine please have a look at /cluster/project8/IBDAJE/Annotation/GRCh37/processed-chr22-esp-exac.vcfout

I still need to split multiple alt alleles onto separate lines.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71709034.

Adam P. Levine

pontikos commented 9 years ago

What happens when the CSQ allele is "-"?

22 19954916 . AT A,ATT 16695.15 . BaseQRankSum=-3.450e-01;ClippingRankSum=0.081;DP=11501;FS=0.746;InbreedingCoeff=-0.2665;MLEAC=56,39;MLEAF=0.133,0.092;MQ=70.00;MQ0=0;MQRankSum=0.111;QD=3.87;ReadPosRankSum=0.234;CSQ=-|ENSG00000093010|ENST00000428707|Transcript|frameshift_variant|424|426|142|H/X|caT/ca|||1|COMT|HGNC|2228||||||||||||||,TT|ENSG00000093010|ENST00000428707|Transcript|frameshift_variant|424|426|142|H/HX|caT/caTT|||1|COMT|HGNC|2228||||||||||||||

On 27 January 2015 at 22:15, APLevine notifications@github.com wrote:

Nikolas,

That looks good.

I think if it is easier if you split the multiple alternate alleles before parsing the data to VEP.

It seems that some fields are missing. According to the VEP, for each output you should have the following in addition to the custom annotation: "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position| CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|CANONICAL|SIFT|PolyPhen|GMAF|AFRMAF|AMR

MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|CAROL|Condel"

Is there a reason why these are not all in the output?

You also have not included the genotype data. That is okay for now.

Adam

Adam P. Levine

On 27 January 2015 at 19:17, Nikolas Pontikos notifications@github.com wrote:

@APLevine https://github.com/APLevine please have a look at

/cluster/project8/IBDAJE/Annotation/GRCh37/processed-chr22-esp-exac.vcfout

I still need to split multiple alt alleles onto separate lines.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71709034.

Adam P. Levine

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71739499.

pontikos commented 9 years ago

Samples have now been added.

What functionality is still missing? I still need to calculate the freq within the samples?

APLevine commented 9 years ago

Yes, perhaps that should be an optional additional function or could be done by another script that is used afterwards. Have a look at the groups part of my script.

Adam

Adam P. Levine On 28 Jan 2015 00:44, "Nikolas Pontikos" notifications@github.com wrote:

Samples have now been added.

What functionality is still missing? I still need to calculate the freq within the samples?

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71760334.

APLevine commented 9 years ago

That is the VEP recoding the indels.

Rather than AT>A it is describing it as T>-. Similarly for the other option rather than AT>ATT it is describing it as T>TT.

Adam

Adam P. Levine On 28 Jan 2015 00:14, "Nikolas Pontikos" notifications@github.com wrote:

What happens when the CSQ allele is "-"?

22 19954916 . AT A,ATT 16695.15 . BaseQRankSum=-3.450e-01;ClippingRankSum=0.081;DP=11501;FS=0.746;InbreedingCoeff=-0.2665;MLEAC=56,39;MLEAF=0.133,0.092;MQ=70.00;MQ0=0;MQRankSum=0.111;QD=3.87;ReadPosRankSum=0.234;CSQ=-|ENSG00000093010|ENST00000428707|Transcript|frameshift_variant|424|426|142|H/X|caT/ca|||1|COMT|HGNC|2228||||||||||||||,TT|ENSG00000093010|ENST00000428707|Transcript|frameshift_variant|424|426|142|H/HX|caT/caTT|||1|COMT|HGNC|2228||||||||||||||

On 27 January 2015 at 22:15, APLevine notifications@github.com wrote:

Nikolas,

That looks good.

I think if it is easier if you split the multiple alternate alleles before parsing the data to VEP.

It seems that some fields are missing. According to the VEP, for each output you should have the following in addition to the custom annotation: "Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|

CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|CANONICAL|SIFT|PolyPhen|GMAF|AFRMAF|AMR

MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|CAROL|Condel"

Is there a reason why these are not all in the output?

You also have not included the genotype data. That is okay for now.

Adam

Adam P. Levine

On 27 January 2015 at 19:17, Nikolas Pontikos notifications@github.com

wrote:

@APLevine https://github.com/APLevine please have a look at

/cluster/project8/IBDAJE/Annotation/GRCh37/processed-chr22-esp-exac.vcfout

I still need to split multiple alt alleles onto separate lines.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71709034.

Adam P. Levine

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71739499.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71756993.

pontikos commented 9 years ago

Ok for INDELs we don't have any frequency information from ExAC correct?

On 28 January 2015 at 08:54, APLevine notifications@github.com wrote:

Yes, perhaps that should be an optional additional function or could be done by another script that is used afterwards. Have a look at the groups part of my script.

Adam

Adam P. Levine On 28 Jan 2015 00:44, "Nikolas Pontikos" notifications@github.com wrote:

Samples have now been added.

What functionality is still missing? I still need to calculate the freq within the samples?

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71760334.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-71800223.

pontikos commented 9 years ago

Having spoken to @vplagnol yesterday, we decided we would like 3 files to be output from the VEP VCF output:

1) variants and their annotations including their frequencies, this should probably include the frequencies in the samples we have annotated for filtering. 2) genotypes of individuals at those variants recoded as 0,1,2 for the number of alternative alleles at this locus. If no alt alleles are present at this locus ie GT="./." , this is recoded as NA. This file should be loadable into R. 3) the allele depths and the total depths, again this might be useful for filtering

All 3 files are indexed by a variant id (VID) which looks like: chrom_position_ref_alt Recall there is only a single variant at each position after splitting #26

pontikos commented 9 years ago

@APLevine in your Python script: https://github.com/vplagnol/pipelines/blob/master/annotation/extract_VEP.py#L20 You define the following columns in the output:

count_labels = ["WT/HET/HOM/MISS","AF","MF","MISS-F","ALLELE-AF","ALLELE-MISS","ALLELE-ALT","ALLELE_TOTAL","ALLELE_MF"]

Which ones do you think are important for me to include in my script? https://github.com/vplagnol/pipelines/blob/master/annotation/processVEP.py

Some example output of the three files generated can be found here:

/cluster/project8/IBDAJE/Annotation/GRCh37/sample_genotypes.csv
/cluster/project8/IBDAJE/Annotation/GRCh37/variant_annotations.tab
/cluster/project8/IBDAJE/Annotation/GRCh37/sample_genotypes_quality.tab
APLevine commented 9 years ago

Probably WT/HET/HOM/MISS and AF.

Adam

Adam P. Levine On 29 Jan 2015 19:29, "Nikolas Pontikos" notifications@github.com wrote:

@APLevine https://github.com/APLevine in your Python script:

https://github.com/vplagnol/pipelines/blob/master/annotation/extract_VEP.py#L20 You define the following columns in the output:

count_labels = ["WT/HET/HOM/MISS","AF","MF","MISS-F","ALLELE-AF","ALLELE-MISS","ALLELE-ALT","ALLELE_TOTAL","ALLELE_MF"]

Which ones do you think are important for me to include in my script? https://github.com/vplagnol/pipelines/blob/master/annotation/processVEP.py

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-72088089.

pontikos commented 9 years ago

Issues raised by @vplagnol

1- looking at chr13 for example, I see min(which(data$EA_MAF != ".")) 59500 so the first 60K positions have essentially no annotations. Any idea why? Some centromere thing?

Good point, no idea, I'm going to look into it but this may well be the case.

2- Also the 1KG positions look like G:0.058605,G:0.058605 Looks like it is repeated, which must be a bug of some sort. I think what we need is:

  • make it a number
  • but make sure that the allele is matching the alternate allele listed for that variant We'll want to filter on that number, hence the need to have it cleaned up.

Yes I expect there to be redundancy as this part of the CSQ field which is allele specific. For example if there are two transcripts, then all non-transcript specific fields are repeated:

Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation DISTANCE STRAND SYMBOL SYMBOL_SOURCE HGNC_ID CANONICAL SIFT PolyPhen GMAF AFR_MAF AMR_MAF ASN_MAF EUR_MAF AA_MAF EA_MAF CLIN_SIG SOMATIC CAROL Condel
G ENSG00000198033 ENST00000400113 Transcript synonymous_variant 237 132 44 G ggT/ggC rs36215075 -1 TUBA3C HGNC HGNC:12408 YES G:0.0303 G:0.01 G:0.05 G:0.01 G:0.05 G:0.016568 G:0.058605
G ENSG00000198033 ENST00000618094 Transcript synonymous_variant 180 132 44 G ggT/ggC rs36215075 -1 TUBA3C HGNC HGNC:12408 G:0.0303 G:0.01 G:0.05 G:0.01 G:0.05 G:0.016568 G:0.058605

When I split by column, I don't eliminate the redundancy but maybe I should? I will do it at least for the MAFs.

pontikos commented 9 years ago

Here's another strange one with '&' in the MAF fields:

Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation DISTANCE STRAND SYMBOL SYMBOL_SOURCE HGNC_ID CANONICAL SIFT PolyPhen GMAF AFR_MAF AMR_MAF ASN_MAF EUR_MAF AA_MAF EA_MAF CLIN_SIG SOMATIC CAROL Condel
C ENSG00000156413 ENST00000286955 Transcript synonymous_variant 1908 855 285 P ccA/ccG rs13346240&rs112313064&COSM3766452&COSM3766451 -1 FUT6 HGNC HGNC:4017 T:0.4357 C:0.02&C:0.81 C:0.0028&C:0.52 C:0.0017&C:0.62 C:0.02&C:0.39 0&0&1&1
C ENSG00000156413 ENST00000527106 Transcript synonymous_variant 1124 855 285 P ccA/ccG rs13346240&rs112313064&COSM3766452&COSM3766451 -1 FUT6 HGNC HGNC:4017 T:0.4357 C:0.02&C:0.81 C:0.0028&C:0.52 C:0.0017&C:0.62 C:0.02&C:0.39 0&0&1&1
C ENSG00000156413 ENST00000318336 Transcript synonymous_variant 2050 855 285 P ccA/ccG rs13346240&rs112313064&COSM3766452&COSM3766451 -1 FUT6 HGNC HGNC:4017 YES T:0.4357 C:0.02&C:0.81 C:0.0028&C:0.52 C:0.0017&C:0.62 C:0.02&C:0.39 0&0&1&1
C ENSG00000156413 ENST00000524754 Transcript synonymous_variant 1495 855 285 P ccA/ccG rs13346240&rs112313064&COSM3766452&COSM3766451 -1 FUT6 HGNC HGNC:4017 T:0.4357 C:0.02&C:0.81 C:0.0028&C:0.52 C:0.0017&C:0.62 C:0.02&C:0.39 0&0&1&1
C ENSG00000156413 ENST00000592563 Transcript synonymous_variant 855 855 285 P ccA/ccG rs13346240&rs112313064&COSM3766452&COSM3766451 -1 FUT6 HGNC HGNC:4017 T:0.4357 C:0.02&C:0.81 C:0.0028&C:0.52 C:0.0017&C:0.62 C:0.02&C:0.39 0&0&1&1

The original line in the VEP output:

chr19   5831713 .       T       C       6003.89 .       BaseQRankSum=-2.379e+00;ClippingRankSum=0.113;DP=339;FS=5.923;InbreedingCoeff=-0.1748;MLEAC=13;MLEAF=0.542;MQ=69.58;MQ0=0;MQRankSum=-4.890e-01;QD=20.22;ReadPosRankSum=0.180;CSQ=C|ENSG00000156413|ENST00000286955|Transcript|synonymous_variant|1908|855|285|P|ccA/ccG|rs13346240&rs112313064&COSM3766452&COSM3766451||-1|FUT6|HGNC|HGNC:4017||||T:0.4357|C:0.02&C:0.81|C:0.0028&C:0.52|C:0.0017&C:0.62|C:0.02&C:0.39||||0&0&1&1||,C|ENSG00000156413|ENST00000527106|Transcript|synonymous_variant|1124|855|285|P|ccA/ccG|rs13346240&rs112313064&COSM3766452&COSM3766451||-1|FUT6|HGNC|HGNC:4017||||T:0.4357|C:0.02&C:0.81|C:0.0028&C:0.52|C:0.0017&C:0.62|C:0.02&C:0.39||||0&0&1&1||,C|ENSG00000156413|ENST00000318336|Transcript|synonymous_variant|2050|855|285|P|ccA/ccG|rs13346240&rs112313064&COSM3766452&COSM3766451||-1|FUT6|HGNC|HGNC:4017|YES|||T:0.4357|C:0.02&C:0.81|C:0.0028&C:0.52|C:0.0017&C:0.62|C:0.02&C:0.39||||0&0&1&1||,C|ENSG00000156413|ENST00000524754|Transcript|synonymous_variant|1495|855|285|P|ccA/ccG|rs13346240&rs112313064&COSM3766452&COSM3766451||-1|FUT6|HGNC|HGNC:4017||||T:0.4357|C:0.02&C:0.81|C:0.0028&C:0.52|C:0.0017&C:0.62|C:0.02&C:0.39||||0&0&1&1||,C|ENSG00000156413|ENST00000592563|Transcript|synonymous_variant|855|855|285|P|ccA/ccG|rs13346240&rs112313064&COSM3766452&COSM3766451||-1|FUT6|HGNC|HGNC:4017||||T:0.4357|C:0.02&C:0.81|C:0.0028&C:0.0/1:20,19:39    0/1:10,12:22    0/1:21,14:35    0/1:11,31:42

No idea how to interpret this... @vplagnol @APLevine ? Shall I ask VEP developers?

pontikos commented 9 years ago

I've run that variant using the webtool: http://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=T6pK51e9LuJlbJfH-577229 It still doesn't shed any light on why there are two very different AFs returned.

APLevine commented 9 years ago

Are these b38 data? I assume they must be because the chromosome is prefixed with "chr". Remember if you want to use the webtool with b37 data you need to use http://grch37.ensembl.org/Homo_sapiens/Tools/VEP.

Are the frequencies with & from custom annotations or the VEP?

APLevine commented 9 years ago

When I split by column, I don't eliminate the redundancy but maybe I should? I will do it at least for the MAFs.

I know you kept it so that if there are two transcripts you would get synonymous_variant,synonymous_variant as it is easier to separate out the transcript specific effects. However, I think this redundancy should be removed. It just takes up extra space.

pontikos commented 9 years ago

I'm using b38.

I think it's because we have two SNPs in the output: http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=19:5831213-5832213;tl=T6pK51e9LuJlbJfH-577229;v=rs13346240;vdb=variation;vf=9502280 and http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=19:5831213-5832213;source=dbSNP;tl=T6pK51e9LuJlbJfH-577229;v=rs112313064;vdb=variation;vf=24908425

However, the freq of the second matches but I don't see any freq info for the first.

On 31 January 2015 at 14:36, APLevine notifications@github.com wrote:

Are these b38 data? I assume they must be because the chromosome is prefixed with "chr". Remember if you want to use the webtool with b37 data you need to use http://grch37.ensembl.org/Homo_sapiens/Tools/VEP.

Are the frequencies with & from custom annotations or the VEP?

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-72320071.

APLevine commented 9 years ago

Hmm, unless VP can shed any light on that I suggest you email the ensembl dev team.

APLevine commented 9 years ago

I see that you have asked them already...

vplagnol commented 9 years ago

Don't expect perfection guys. There will always be small issues on a genome-wide basis that complicates things a bit. It should not prevent us from getting it working on 99% of the output.

A question I asked Niko: can we make sure that we have 1KG data on almost all variants? My intuition is that the vast majority of what we find should be in 1KG and the liftover done by ensembl should be good. It covers intergenic regions so that is quite important as we transition to WGS.

On Sat, Jan 31, 2015 at 3:08 PM, APLevine notifications@github.com wrote:

Hmm, unless VP can shed any light on that I suggest you email the ensembl dev team.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-72321207.

Vincent Plagnol University College London Genetics Institute Darwin building, office 210 Gower Street, London, WC1E 6BT Cell: +44-(0)7946-546923 Office: +44-(0) 2031-084002 Website: http://www.ucl.ac.uk/ugi/research/vincentplagnol

pontikos commented 9 years ago

Ok when the scripts are working I will tabulate the number of variants which have 1kg annotation, the AFR_MAF field, in the annotations.tab file.

On 31 January 2015 at 15:13, Vincent Plagnol notifications@github.com wrote:

Don't expect perfection guys. There will always be small issues on a genome-wide basis that complicates things a bit. It should not prevent us from getting it working on 99% of the output.

A question I asked Niko: can we make sure that we have 1KG data on almost all variants? My intuition is that the vast majority of what we find should be in 1KG and the liftover done by ensembl should be good. It covers intergenic regions so that is quite important as we transition to WGS.

On Sat, Jan 31, 2015 at 3:08 PM, APLevine notifications@github.com wrote:

Hmm, unless VP can shed any light on that I suggest you email the ensembl dev team.

— Reply to this email directly or view it on GitHub https://github.com/vplagnol/pipelines/issues/17#issuecomment-72321207.

Vincent Plagnol University College London Genetics Institute Darwin building, office 210 Gower Street, London, WC1E 6BT Cell: +44-(0)7946-546923 Office: +44-(0) 2031-084002 Website: http://www.ucl.ac.uk/ugi/research/vincentplagnol

— Reply to this email directly or view it on GitHub.