monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

add coriell catalog #47

Closed nlwashington closed 9 years ago

nlwashington commented 9 years ago

add the Coriell catalog, either directly from FTP (see disco for connection details), or via the NIF REST services for nif-0000-00182. we might consider adding a NLP step to acquire the phenotype details.

the coriell catalog contains various cell lines that come from patients (and their families) of those afflicted with certain diseases. sometimes the lines could be unaffected family members, often to be used as controls.

i believe they may also host mouse cell lines, where they are models of disease, and not actually afflicted with the human disease, but i don't know if we have access to those yet.

for example: https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=GM08859&PgId=166

bryanlaraway commented 9 years ago

Assigning this to myself for tracking.

nlwashington commented 9 years ago

@cmungall and @mbrush , should the cell lines in this set be some/any/all of:

  1. line derives_from patient_with_disease, where patient_with_disease is some instance of disease (for which we usually have identifiers)?
  2. line is_model_of disease, where is_model_of needs a RO term?

i don't think it's quite right to say line has_phenotype disease, because a line doesn't necessarily present with all attributes of a disease (and may not show any of the attributes, if it's taken from an unaffected tissue from an individual with the disease.)

nlwashington commented 9 years ago

here's some example triples for this data?

    :NIGMSrepository a CLO_0000008 #repository  ?
        label : NIGMS Human Genetic Cell Repository
        foaf:page https://catalog.coriell.org/0/sections/collections/NIGMS/?SsId=8
    line_id a CL_0000057,  #fibroblast line 
        derives_from patient_id, uberon:Fibroblast
        part_of :NIGMSrepository
        #we also have the age_at_sampling type of property
    patient id a foaf:person, proband, OMIM:disease_id
        label: "fibroblast from patient 12345 with disease X"
        member_of family_id  #what is the right thing here?
        SIO:race EFO:caucasian  #subclass of EFO:0001799
        in_taxon NCBITaxon:9606
        dc:description Literal(remark)
        GENO:has_genotype genotype_id
    family_id a owl:NamedIndividual
        foaf:page "https://catalog.coriell.org/0/Sections/BrowseCatalog/FamilyTypeSubDetail.aspx?PgId=402&fam=2104&coll=GM"
    genotype_id a intrinsic_genotype
        GENO:has_variant_part allelic_variant_id
        #we don't necessarily know much about the genotype, other than the allelic variant.
        #also there's the sex here)
    pub_id mentions cell_line_id  

for the allelic id, try to use the dbsnp id, but sameAs the omim id.

@cmungall is http://bioportal.bioontology.org/ontologies/FHHO the right thing to use here for relations between patients and family members, if we should be capturing it? also, what is the proper "age" or "stage" relationship to use here? it's to reflect the age at which the same was taken from the patient.

cmungall commented 9 years ago

Example from CLO: http://purl.obolibrary.org/obo/CLO_0025179

this could be simplified

I think FHHO is fine for now.

nlwashington commented 9 years ago

great, we can use those patterns too. :)

nlwashington commented 9 years ago

is model for: http://www.ebi.ac.uk/cellline#is_model_for

cmungall commented 9 years ago

@jamesmalone is this still used in EFO? Looks like it may have been a placeholder. OK with adding this to RO?

jamesmalone commented 9 years ago

Absolutely.

cmungall commented 9 years ago

@jamesmalone and of course we'd probably want to coordinate with you as you have java for converting coriell into EFO I imagine. Our stub code is here: https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/Coriell.py

mellybelly commented 9 years ago

@mbrush can you comment on this. What is the relationship between Coriell and CLO at this point? we did some alignment between CLO and CL that would be good to maintain. Also there are a number of properties for relating cell lines in ERO to consider (model_of, derivation, etc.). Would be good to share pipeline with @jamesmalone but best if we can retain some of the alignment.

bryanlaraway commented 9 years ago

@cmungall @mbrush Need assistance on the mapping of cell types. My latest commit includes the mapping of cell types, and I have five that are uncertain or unknown. Coriell provides a help file for definitions.

'Amniotic fluid-derived cell line': No exact match, but assuming amniocyte, CL:0002323: "A cell of a fetus which is suspended in the amniotic fluid."

'Chorionic villus-derived cell line': No Match. Coriell defines as "Chorionic villus cultures are established from the mesenchyme core cells of the villi after first removing the trophoblast layers by dissection followed by enzymatic dissociation of the core."

'Erythroleukemic cell line': No Match. Coriell defines as "Abnormal precursor (virally transformed) of mouse erythrocytes that can be grown in culture and induced to differentiate by treatment with, for example, DMSO."

'Microcell hybrid': No Match. Coriell defines as "A hybrid cell produced by the fusion of a micro cell with the cell of another species. Microcells contain only a portion of the genome and cytoplasm of the cell from which they are derived. Microcells are produced by colcemid treatment to promote nuclear fragmentation into micronuclei followed by cytochalasin B treatment to extrude these micronuclei which are finally sheared from the cell by centrifugal force during centrifugation. Consequently each microcell contains only one or a few human chromosomes. The subset of microcell hybrids with a chromosome that carries a selectable marker may be then be isolated."

'Tumor-derived cell line': No Match. "Cells isolated from a mass of neoplastic cells, i.e., a growth formed by abnormal cellular proliferation." Assuming 'Oncocyte'? CL:0002198

Edit: Found another entry: Adipose stromal cell: Mapped to mesenchymal stem cell of adipose (CL:0002570). Correct?

bryanlaraway commented 9 years ago

@cmungall @mbrush @mellybelly @nlwashington Working on mapping race/ethnicity of patients. Don't see all of the ones I need in EFO. Should I submit term requests to EFO, or is there another ontology that might be used?

cmungall commented 9 years ago

yes, coordinate with efo and we should think carefully about how this manifests in the UI

jamesmalone commented 9 years ago

As FYI @daniwelter started building out an ethnicity ontology separate from EFO which we haven't imported yet as it's not public. It lacks some textual definitions but does have good axiomatisation defining populations and so on. The countries are all just minted in here but that was because at the time we could not open Gaz in anything - now it loads into OLS we would swap those URIs out for Gaz URIs. Ontology is here: https://github.com/daniwelter/ethnonto

bryanlaraway commented 9 years ago

Placed a draft concept map for Coriell in the DropBox: LAMHDI-Project/Data modeling/Dipper Concept Maps/Coriell.cmap. Need to update the map with annotations for the different mapping functions (race, cell type, etc.) that I have created.

nlwashington commented 9 years ago

see related ticket for properly modeling age: https://github.com/monarch-initiative/dipper/issues/78

nlwashington commented 9 years ago

the variants that are part of the genotype are to OMIM variant ids, like: OMIM:607840.0014.

these are instances of some variation, and not classes. we currently map any OMIM prefix to the ontology purl. @cmungall should these variants go there, or to the omim entry page, like: http://omim.org/entry/607840#0014

nlwashington commented 9 years ago

this has been moved onto bamboo and ready for testing in scigraph.

still todo is to hookup the ftp. bamboo job currently runs off static files.

nlwashington commented 9 years ago

this now pulls from the coriell sftp site.

nlwashington commented 9 years ago

coriell "families" have an internal identifier, but can't be resolved on their site. (these are groupings for individuals/people that literally are in the same family.) @cmungall or @mellybelly do you have a preference for a "monarch" identifier, or just an anonymous (BNode) id.

nlwashington commented 9 years ago

@mbrush close if satisfied, or reassign after review.

nlwashington commented 9 years ago

@mbrush some records like https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=ND00372 will say that an individual tested wildtype for a certain variation. should we capture this in a genotype like has_reference_part variant_id ?

mbrush commented 9 years ago

Modeling the asserted absence of a variation is an interesting problem - rife with practical and philosophical implications. I don't think it is right to say a genotype has_reference_part some variant id.

One solution is to say has_reference_part [IRI of the gene that is not variant] - since we are conceptualizing punned gene IRIs in our data as representing the idea of the canonical gene. My concern here is that the asserted absence of a specific variant does not mean that the gene as a whole is canonical/wt.

Another solution is to create some property to indicate the absence of a specific variant (e.g. lacks_variant). But this is ontologically controversial and could have reasoning implications (if I recall the debate over lacks_parts properties from anatomical ontologies).

mbrush commented 9 years ago

Another issue to fix in the coriell dataset is that the affected gene is not currently modeled. This is usually done form the variant locus using the 'is_sequecne_variant_of' property - but this variant locus node is not represented in this dataset due to ambiguity in cases of dual mutations as to whether they are cis or trans.

We could instead link to the gene from the sequence alteration node, which is represented in the dataset. Here we could (1) create a new property that links a sequence alteration to the gene in which it is found, or (2) use the existing is_subsequence_of property, or (3) generalize the existing is_sequence_variant_of relation to cover links from alterations to the affected gene.

Alternatively, we could create a bnode for the variant locus just for the purpose of linking through to the gene (but which isnt linked to the sequence alterations in cases of dual mutations because the cis/trans nature of these is not clear).

We will revisit this issue as the coriell dataset gets fleshed out.