Add tagging allele information for wildtype alleles to haplotype spreadsheet

The ontology currently has a significant shortcoming: the most common alleles 
(the wild type alleles, usually denoted by *1 and being listed in the first row 
of the haplotype definition spreadsheet) have no tagging SNPs, so they are 
never inferred from patient data. This needs to be changed.

I suggest doing the following:
1) Generate a list of SNPs tested by 23andMe V2 for each gene. This list can be 
generated by running the following SPARQL query over the ontology.

PREFIX cds: <http://www.genomic-cds.org/ont/genomic-cds.owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?allele ?polymorphism {
?polymorphism rdfs:subClassOf cds:polymorphism .
?polymorphism cds:can_be_tested_with cds:23andMe_v2 .    
?polymorphism cds:relevant_for ?allele .                           
 }    
ORDER BY ?allele

2) Modify the haplotype definition spreadsheet. For each wildtype allele (in 
the first row), turn all SNPs that are tested by 23andMe V2 into tagging SNPs 
by adding " [tag]" to the respective cells.

3) After doing this, run the script for generating the ontology. See if there 
are any inconsistencies (because of overlapping allele/haplotype definitions 
introduced in step 2). If there are any such inconsistencies, try to fix them 
by turning more SNPs in the wildtype allele into tagging SNPs (*only* modify 
wildtype allele definitions, keep all other rows untouched).

After doing this, I expect that far more alleles and CDS rules will match for 
each patient.

Original issue reported on code.google.com by matthias...@gmail.com on 11 Aug 2013 at 11:30

pyukman / genomic-cds

Add tagging allele information for wildtype alleles to haplotype spreadsheet #14