statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

run tripal curator for biomaterials #334

Open bradfordcondon opened 6 years ago

bradfordcondon commented 6 years ago

Pilot project by @bradfordcondon

on dev...

Once this is done, we can discuss converting all properties on live.

bradfordcondon commented 6 years ago
bradfordcondon commented 6 years ago

@mestato see https://github.com/statonlab/tripal_curator/blob/master/docs/Edit_by_CV.md

there's only 11 other terms, so i can just go ahead and change them if you want. Although maybe you should try one to give me feedback on the tool?

I converted biomaterials using tripal:temperature to PATO's temperature term:

http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0000146

nb i edited on the DEV SITE. so: https://hardwoods.ag.utk.edu/admin/tripal/extension/tripal_curator

bradfordcondon commented 6 years ago

terms still using "biomaterial property" CV: at https://hardwoods.ag.utk.edu/admin/tripal/extension/tripal_curator/CV_usage/42

need to pick PATO terms for ...

original term target PATO term URL comments
temperature temperature http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0000146
treatment treatment (with EFO) https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000727
age age http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0000011
cultivar strain (note cultivar is in EFO) http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0001034 not a PATO for variety or cultivar
tissue tissue (SIO) https://www.ebi.ac.uk/ols/ontologies/sio/terms?iri=http%3A%2F%2Fsemanticscience.org%2Fresource%2FSIO_010002
label bad term, remove? otherwise: SIO label https://www.ebi.ac.uk/ols/ontologies/sio/terms?iri=http%3A%2F%2Fsemanticscience.org%2Fresource%2FSIO_000179
collected_by
investigation_type set to eukaryote, almost certainly a bug from the faulty loader
dev_stage developmental stage (EFO) https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000399
geographic location Geographic location EDAM(DATA) https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_3720
vector
geo_loc_name
bradfordcondon commented 6 years ago

/remind me in 24 hours

Oops! I used the wrong ontology

https://hardwoodgenomics.org/cv/lookup/PO THIS is the plant trait ontology

reminders[bot] commented 6 years ago

@bradfordcondon set a reminder for Aug 15th 2018

bradfordcondon commented 6 years ago

temperature

https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000146

TO's temperature = PATO's temperature

A physical quality of the thermal energy of a system. [ PATOC:GVG ]

treatment

https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPECO_0007359

plant experimental condition PECO:0007359

age

http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0000011

cultivar

no word for cultivar or variety. need to browse.

tissue

i know meg found one, i cant find it....

label

collected_by

dev_stage

https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000261

maturity?

A quality of a single physical entity which is held by a bearer when the latter exhibits complete growth, differentiation, or development. [ Merriam-Webster:Merriam-Webster ]

position

https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000140

A spatial quality inhering in a bearer by virtue of the bearer's spatial location relative to other objects in the vicinity. [ PATOC:GVG ]

the TO uses this PATO term.

I like the EDAM geographic location more.

reminders[bot] commented 6 years ago

:wave: @bradfordcondon,

bradfordcondon commented 6 years ago

This is on hold until i load the plant trait ontology in satisfactorily. We may decide to not use terms from this ontology if they wont look nice in a browser.

bradfordcondon commented 5 years ago

hmmm i THINK the loaders are all nice and fixed now so we could reapproach this.

That said, as part of https://github.com/NAL-i5K/tripal_eutils we were looking at this... we should rethink mapping the properties. As we've talked about on that project, it would be ideal if ncbi were the ones to map their properties to ontology terms. Will they?

mestato commented 5 years ago

A few resources specifically for biosample records:

  1. NCBI kind of has their own controlled vocab:

  2. Updates to BioSamples database at European Bioinformatics Institute

    • https://academic.oup.com/nar/article/42/D1/D50/1048301
    • relevant items I found while skimming:
    • " The EBI’s BioSamples database is developed in parallel with the NBCI’s BioSamples database"
    • "experimental factor ontology (EFO) "
    • "The ENCODE (12) data coordination centre is working with BioSamples database to ensure their existing sample records are updated and annotated with ontology terms"
    • "EBI Resource Description Framework (RDF) platform (http://www.ebi.ac.uk/rdf) RDF is now available for the BioSamples database content. The schema is derived from the SampleTab format, supported by integration with existing ontologies such as the Ontology of Biomedical Investigations (13) and EFO. Data are made available as RDF and also for query via a SPARQL endpoint for which example queries are documented."
  3. No idea if this is helpful but at least we aren't the only people who've noticed this problem: Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies https://arxiv.org/abs/1708.01286

bradfordcondon commented 5 years ago

Thanks for the thoughtful reply

https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/

I actually do use that: the eutils module imports it into a "ncbi biosample" CV.

However there is no versioning, no relationships ( this is a CV not an ontology, right?).

Ideally, NCBI would provide an OBO that has all these terms and perhaps says that each term is_a some term in the EFO or some other ontology. perhaps thats what EBI's mapping does, ill have a look.

from https://arxiv.org/abs/1708.01286

The BioSample metadata field names and their values are not
standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified
fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or
numeric fields are often populated with inadequate values of different data types
(e.g., only 27% of Boolean values are valid)

yes! exactly.

bradfordcondon commented 5 years ago

ok, so EBI biosample attributes look like this:

https://www.ebi.ac.uk/biosamples/samples/SAMN02953603

  <Property class="sex" characteristic="false" comment="false" type="STRING">
    <QualifiedValue>
      <Value>female</Value>
      <TermSourceREF><Name/>
        <TermSourceID>http://purl.obolibrary.org/obo/PATO_0000383</TermSourceID>
      </TermSourceREF>
    </QualifiedValue>
  </Property>
  <Property class="sub species" characteristic="false" comment="false" type="STRING">
    <QualifiedValue>
      <Value>familiaris</Value>
    </QualifiedValue>
  </Property>

here it is in genbank:

<Attributes>
      <Attribute attribute_name="sex" harmonized_name="sex" display_name="sex">female</Attribute>
      <Attribute attribute_name="sub-species" harmonized_name="sub_species" display_name="sub species">familiaris
      </Attribute>
      <Attribute attribute_name="breed" harmonized_name="breed" display_name="breed">boxer</Attribute>
    </Attributes>

so you get this ontologyTerms key, if applicable. Pretty cool honestly. Will NCBI ever support something like this? I dont know. If htey did adopt this it would break our importers because it changes the XML structure (yayyyy)

edit: also the term appears to be for the VALUE not for the TYPE. http://www.ontobee.org/ontology/PATO?iri=http://purl.obolibrary.org/obo/PATO_0000383 its linked to FEMALE not sex . You can also link a term to the type. So thats pretty awesome.

I'm left not knowing a) who decides how these terms are mapped b) if NCBI cares

honestly the ontology tagging looks totally unstructured check out the user guide https://www.ebi.ac.uk/biosamples/docs/cookbook/curate_sample.html

mestato commented 5 years ago

Wow, thats suprisingly not that helpful. I mean its helpful if someone has already selected a term for the value, but how often is that the case? Its not required, right?

Type terms seem like the logical place to start, especially for unifying information across databases. The NCBI non-hierarchical vocab for biosample has 615 terms by my count. I wonder if we divided them up among everyone in the lab how long it would take to assign terms.... and how reasonable the results would be. And then if anyone would actually use them other than us.

bradfordcondon commented 5 years ago

The NCBI non-hierarchical vocab for biosample has 615 terms by my count.

I think you could start by picking a single "package". think you can get away with the base package and its only ~18 terms. Thats what (i think) the majority of the end users end up using. its why teh example data i built eutils with only uses the same set of 20 terms.

I think taht offering suggested mapping for the base terms would be a nice contribution for the chado mapping paper too....

bradfordcondon commented 5 years ago

here is my table for the vanilla plant submission terms for the NCBI.

https://docs.google.com/spreadsheets/d/1uO2Pu4Kh_pcyHfbeAGr72zZp9JWtcD667lw9LP3B8Is/edit#gid=0

I am trying to limit myself to, in order of preference, SIO/EFO, PO/ TO ontologies. If theres not a direct match, its not easy work.

I think I already did some work on this in another issue so im going to pause before i do too much more. Basically, doing this is a total PIA. Some of the terms are too ambiguous. Perhaps looking through synonyms might help...