precisely / bioinformatics

0 stars 0 forks source link

Compare coverage for genes and SNPs in Elizabeth's research on both ThermoFisher PMRA (Akesogen DTC) and Illumina GSA (23andMe v5) #32

Closed taltman closed 6 years ago

taltman commented 6 years ago

@aneilbaboo commented on Tue Mar 06 2018

ME/CFS curated genes


@taltman commented on Thu Mar 22 2018

@aneilbaboo, I think this is more of a bioinformatic task, now that I've created dependencies that separate out the curation work from the data processing work.


@aneilbaboo commented on Thu Mar 22 2018

@taltman - Please feel free to move this over to bioinformatics if you like.

There's a [Move Issue] button here -------------------------------------------------------------->
taltman commented 6 years ago

Using data scraped out of Elizabeth's working document:

https://github.com/precisely/gene-panel-curation/blob/master/doc/hupf/gene_system_curation_3-27.xlsx

... I've computed that, out of 76 experimentally-validated ME/CFS variants curated so far, only 15 of them are represented directly on the Affy chip, while the 23andMe arrays range from 30 to 50.

While direct measurement is preferable to imputation, for both platforms the majority of ME/CFS variants might need to be imputed for some 23andMe array versions. So the next question is the degree of quality imputation coverage on the remainder of our curated ME/CFS variants.

taltman commented 6 years ago

@aneilbaboo, forgot to tag you in the above comment.

taltman commented 6 years ago

@aneilbaboo Here's an update on my comparison:

I've obtained a sample ThermoFisher PMRA VCF file. I am working with Ricky at Akesogen to figure out issues with it (severely malformed header, odd chromosome labels, etc.).

I've been able to impute a test file of chromosome 22 loci against the 1k Genomes Project genotyping dataset of n>2,000 fully-sequenced individuals using the Beagle imputation software. It will impute all loci not found in a target VCF file (e.g., a VCF derived from 23andMe or Affy PMRA data), so all ME/CFS-associated variants not directly measured will have an estimate with an associated probability score.

Being able to do imputation across all chromosomes simultaneously will take a bit more work and computational horsepower. Once I do that, I can also assess what are the probability levels for the loci out of the 76 that Elizabeth has identified (see above) that are not found on either the 23andMe array(s) or the PMRA. The probabilities will help us figure out whether the 23andMe loci set really do a better job in covering all 76 loci of interest to us.

taltman commented 6 years ago

@aneilbaboo:

Using the dbSNP rsids from the ME/CFS Variant curation sheet (thanks Elizabeth!): https://docs.google.com/spreadsheets/d/1DZ1bf2Ws4GSfDyS4EoCntoNNxoy2GxWMRgsuw4ylfac/

We have 135 unique ME/CFS rsids. Out of that set of 135, the Affy array measures 23 of them. Out of the set of 135, three different 23&Me datasets cover 53, 54, and 79 of them, respectively.

On this criterion alone, it points to using the Illumina array for best direct-measurement of ME/CFS-relevant SNPs. Imputation analysis is coming soon.

taltman commented 6 years ago

@aneilbaboo Here are some stats for keeping in mind during this discussion:

Number of SNPs in example PMRA VCF file (i.e., direct measurements): 872,625

Number of SNPs in three example 23&Me VCF files (i.e., direct measurements): 594936 597,878 959,623

Number of SNPs in latest release of the 1k Genomes Project (as per publication; what we'd impute if not directly measured): 84.7 million

taltman commented 6 years ago

To get a sense for how well we are doing with imputation quality, I tried to take the VCF files from PMRA and 23&Me and run them through a popular open-source imputation program which is unencumbered by restrictive licensing (beagle).

Analyzing one chromosome at a time, I was able to get the 23&Me example VCF files to work, imputing all values not directly measured by the 23&Me array. I was unable to get the PMRA data to work, as there were large intervals along the chromosome without enough SNPs in the PMRA VCF file to allow for the imputation computation to proceed. I am still digging into why this is the case.

Of the 135 SNPs that Elizabeth has curated for ME/CFS, the chromosome "mode" is chromosome 13, with 35 SNPs. I used that chromosome for testing the imputation. I was able to impute all 1k Genome SNPs for chromosome 13 that were not directly measured. All of the predictions for the 35 SNPs not directly measured had a probability of 100%, except for two with 98% and 96%.

Since it is unlikely that the PMRA chip will do significantly better in terms of imputation quality, and the PMRA chip has half the number of directly-measured ME/CFS-relevant SNPs as the 23&Me data files, I'd recommend going with the Illumina-based arrays for the specific case of ME/CFS analysis. In the long-run, the PMRA chip might be better if we have to consider more disease areas.

I'm waiting to get some verified 23&Me "Version 5" datasets, to make sure that the above analysis holds true for their latest array type.