statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

load the plant trait ontology PTO #155

Closed bradfordcondon closed 5 years ago

bradfordcondon commented 6 years ago

OBO file available here

bradfordcondon commented 6 years ago

https://github.com/tripal/tripal/issues/120

let's try following Sofia's suggestion

bradfordcondon commented 6 years ago

Note the "imports" statemetns at the top. i'd bet those get ignored. Those are also .OWL files. So unless we want to load the whole ontologies into our db (CHEBI, PO, RO) we wouldw ant to load the subset .OWL files.

Core is working on an .OWL loader. Maybe this is a "wait and see" situation until we move farther along in the CV project

bradfordcondon commented 6 years ago

New plan:

One could argue we want the full ontologies on our site. Maybe. But i don't want htem all on my dev, if I can expect each one to take ~2 days to load in....

bradfordcondon commented 6 years ago

subontologies are available here https://github.com/bradfordcondon/plant_trait_ontology_import_obos

bradfordcondon commented 6 years ago

Issue: the typedefs dont hae names.


[Typedef]
id: decreased_in_magnitude_relative_to
domain: PATO:0000001 ! quality
range: PATO:0000001 ! quality
is_transitive: true
is_a: different_in_magnitude_relative_to

I think they should have names, and hte names can be equal to id.

I would want to automate this fix for some of the ontologies....

bradfordcondon commented 6 years ago

OK - I've created a "solitaire" Plant trait ontology.

It has all the crossreferences to other ontologies (CHEBI, PATO, GO, RO, PECO) removed. It also has the nameless terms (which i beleive are also crossreferences added "after the fact") removed.

The important thing is, it loads. And, it loads in 1 minute 50 seconds.

This is a HWG decision. Do you want to load the full ontologies for CHEBI, PATO, GO, RO, and PECO? If so, you can load them then load the full PTO. If not, load this solitaire OBO. For a developer site, obviously loading solitaire is the choice.

bradfordcondon commented 6 years ago

here's the core discussion. https://github.com/tripal/tripal/issues/120

bradfordcondon commented 6 years ago

We will load in my miniature pre-requesites.

bradfordcondon commented 6 years ago

Loaded dev.

Note url needs raw https://raw.githubusercontent.com/bradfordcondon/plant_trait_ontology_import_obos/master/pto_simple.obo

bradfordcondon commented 6 years ago

cant submit jobs to the obo loader: it says there is already a job in the queue

bradfordcondon commented 6 years ago

oddly, the trpal jobs queue had to be truncated first (`TRUNCATE tripal_jobs;).

loaded live.

bradfordcondon commented 6 years ago
ownloading URL http://purl.obolibrary.org/obo/to.obo, saving to /tmp/obo_apRpvb
Step 1: Preloading File /tmp/obo_apRpvb...
Step 2: Loading type defs...emory: 48,377,344 bytes.
Step 3: Loading terms...%. Memory: 48,372,168 bytes.
A term that belongs to another ontology is used within this vocabulary.  Therefore a lookup was performed with the EBI Ontology Lookup Service to retrieve the information for this term. Please note, that vocabularies with many non-local terms require remote lookups and these lookups can dramatically decrease loading time.

I'm rerunning hte importer pointed at the true ontology now that core has added EBI-OLS support for terms not in the ontology. It appears to be running OK so far on dev. Need to re-run live if it works.

bradfordcondon commented 6 years ago

https://hardwoods.ag.utk.edu/cv/lookup/TO

screen shot 2018-07-31 at 3 45 18 pm

Looks good!

Let's reload it live as well. : https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/397674

bradfordcondon commented 6 years ago

looks great! two terms were loaded weird because of a colon in the ID: see https://github.com/tripal/tripal/issues/525

I think i can just manually assign/rename and be happy though.

screen shot 2018-07-31 at 6 14 52 pm
bradfordcondon commented 6 years ago
select * from chado.db where name = 'fatty acid anion 18';
db_id   name    description urlprefix   url
(0 rows)
bradfordcondon commented 6 years ago

hmmm im actually not sure how to do that because of weird not null constraints:

select * from chado.dbxref dbx INNER JOIN chado.db db ON db.db_id = dbx.db_id WHERE db.name = 'fatty acid 18';
dbxref_id   db_id   accession   version description db_id   name    description urlprefix   url
245078  190 3           190 fatty acid 18
(1 row)
hardwoods_06112018=> delete from chado.dbxref dbx INNER JOIN chado.db db ON db.db_id = dbx.db_id WHERE db.name = 'fatty acid 18';
ERROR:  syntax error at or near "INNER"
LINE 1: delete from chado.dbxref dbx INNER JOIN chado.db db ON db.db...
                                     ^
hardwoods_06112018=> delete from chado.dbxref WHERE dbxref_id = 245078;
ERROR:  null value in column "dbxref_id" violates not-null constraint
DETAIL:  Failing row contains (134184, 90, CHEBI:132502, , null, 0, 0).
CONTEXT:  SQL statement "UPDATE ONLY "chado"."cvterm" SET "dbxref_id" = NULL WHERE $1 OPERATOR(pg_catalog.=) "dbxref_id""
bradfordcondon commented 6 years ago

reopning this because PATO is NOT THE PLANT TRAIT ONTOLOGY! oops!

I deleted the already loaded plant trait ontology (no relationships) and am reloading now.

load this instead http://www.obofoundry.org/ontology/to.html job: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/413191

bradfordcondon commented 6 years ago

after loading it looks pretty horrible. https://www.hardwoodgenomics.org/cv/lookup/TO

let's delete an re-load just to be sure.

screen shot 2018-08-16 at 9 58 57 am
bradfordcondon commented 6 years ago

https://www.hardwoodgenomics.org/cv/lookup/TO

deleted and reloaded, and its still hideous.

let's examine these root terms vs the OBO file and see if we can figure out the problem.

for example: PATO:0000085 [TO:sensitivity toward] (31)

here's its record:

[Term]
id: PATO:0000085 ! sensitivity toward
is_a: PATO:0001018 ! physical quality

so it should be under PATO:0001018, which should be under PATO:0001241 ! physical object quality which is under PATO:0000001 ! quality. that term itself is not defined in the OBO, its presumably loaded in separately.

select * from chado.cvterm cvt INNER JOIN chado.cv cv ON cv.cv_id = cvt.cv_id  where cvt.name ='quality';
cvterm_id   cv_id   name    definition  dbxref_id   is_obsolete is_relationshiptype cv_id   name    definition
133777  89  quality     244680  0   0   89  bfo The upper level ontology upon which OBO Foundry ontologies are built.
(1 row)
hardwoods_06112018=> select * from chado.cvterm cvt INNER JOIN chado.cv cv ON cv.cv_id = cvt.cv_id  where cvt.name ='physical quality';
cvterm_id   cv_id   name    definition  dbxref_id   is_obsolete is_relationshiptype cv_id   name    definition
138725  97  physical quality        246426  0   0   97  pato    An ontology of phenotypic qualities (properties, attributes or characteristics)
(1 row)

so we do have these term in the db. quality is in bfo instead of pato...

bradfordcondon commented 6 years ago

OK. looks like the PATO terms are loaded in weird: name = PATO:0001241 instead of physical object quality.

select cvt.name from chado.cvterm_relationship cr INNER JOIN chado.cvterm cvt ON cvt.cvterm_id = cr.object_id INNER JOIN chado.cvterm cvtsubj ON cr.subject_id = cvtsubj.cvterm_id  where cvtsubj.name = 'physical quality';
name
PATO:0001241
(1 row)

this is pretty clear from just looking at the term pages. https://www.hardwoodgenomics.org/cv/lookup/TO/quality

screen shot 2018-08-16 at 10 13 01 am

Im going to load the ontologies on a fresh site and see if they load in messed up: if we so we know its still the OBO loader..

confirmed on a fresh site. the obo must be formatted in a different way?

mestato commented 6 years ago

Multiple plant relevant ontologies listed here: http://browser.planteome.org/amigo

Does the PO from obo foundry match the official one from github? (https://github.com/Planteome/plant-ontology) The Planteome expandable/collapsible trees look great for all these ontologies, but I'm not sure how their code may parse/store the obo file differently (or if they even use chado table structure).

bradfordcondon commented 6 years ago

no, looks different. looks like it makes less extensive use of cross-ref'd terms. Let's try this instead, thanks.

bradfordcondon commented 6 years ago

@mestato this is the plant ontology, not the plant trait ontology. Sorry i see you note they list several.... which is the one you'd like the biomaterials to use?

IE http://www.obofoundry.org/ontology/to.html

TO vs PO. Which is the correct one?

You are also correct that we are limited in our ontology usage by Chado. Cross-ref'd terms are troublesome, especially if they are included "as is" instead of as synonyms for terms.

bradfordcondon commented 6 years ago

from http://planteome.org/node/1

screen shot 2018-08-16 at 4 40 28 pm
mestato commented 6 years ago

For the cvterms defining the fields (Tripal/Chado structure), I don't much care where the terms come from -the real importance there is alignment with other Tripal databases, as computer-level (web services) interoperability is the idea. But for the values of the fields (actual data!) - the target audience is plant users (both computationally savvy and not). Since the biomaterials have a lot of fields with values that could be ontologized - plant structure, development stage, experimental treatment - we will have to draw from different ontologies. plant structure => Plant Anatomical Entity, plant development stage => plant structure development stage. For experimental treatments, we need to dig into TO vs PATO, I don't know how they overlap or interrelate . I would prefer TO to align with planteome, but we likely will find more suitable generic terms in PATO. We don't use the environmental ontology right now.

bradfordcondon commented 6 years ago

For the cvterms defining the fields (Tripal/Chado structure), I don't much care where the terms come from -the real importance there is alignment with other Tripal databases, as computer-level (web services) interoperability is the idea

Well, the property terms will determine the CV browser mappings as well.

But I'm hearing you say that we can't expect all properties to map to a single ontology, which does make sense. TO makes heavy use of PATO: I find the PATO terms by searching within TO. I'm thinking that this is OK, and we hsould use these terms- if we have to tweak the browser to display them correctly we can do that.

Mapping the values to cvterms will require quite a bit of thought and work because many of the property values are in fact dense blocks text that need to be split into multiple property value pairs. We'll deal with that after we map the properties themselves, as a start.

bradfordcondon commented 6 years ago

ok, i suspect that the problem isnt the ontology but the loader. All those PATO terms showing at the root have their db/accessions swapped. see https://github.com/tripal/tripal/issues/558

bradfordcondon commented 6 years ago

giving this another try on dev.

because we've got so many messed up terms (ie with the db and accession reversed), im going to delete first. All terms are in the plaint_trait_ontology cv.

bradfordcondon commented 6 years ago

Performing EBI OLS Lookup for: PO:0001108,304 bytes.
Cannot find the term via an EBI OLS lookup: PO:0001108. EBI Reported: Resource not found.Consider finding the OBO file for this ontology and manually loading it first.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find the term via an EBI OLS lookup: PO:0001108. EBI Reported: Resource not found.Consider finding the OBO file for this ontology and manually loading it first.

this error is as-reported in the issue

bradfordcondon commented 6 years ago

now we're waiting on : https://github.com/tripal/tripal/issues/665

bradfordcondon commented 6 years ago

Im checking out the loading separately plan:

bradfordcondon commented 6 years ago

oops: CHEBI runs out of memory, and loading the PO does not prevent the missing PO term error:

Cannot find the term via an EBI OLS lookup: PO:0001108. EBI Reported: Resource not found.Consider finding the OBO file for this ontology and manually loading it first.

I dumped $full_url and $results:

string(110) "http://www.ebi.ac.uk/ols/api/ontologies/po/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPO_0001108"
array(6) {
  ["timestamp"]=>
  int(1537831783800)
  ["status"]=>
  int(404)
  ["error"]=>
  string(9) "Not Found"
  ["exception"]=>
  string(62) "org.springframework.data.rest.webmvc.ResourceNotFoundException"
  ["message"]=>
  string(18) "Resource not found"
  ["path"]=>
  string(90) "/ols/api/ontologies/po/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPO_0001108"

term is clearly in EBI: https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPO_0001108

yep! if we were to look it up uner the TO instead:

http://www.ebi.ac.uk/ols/api/ontologies/to/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPO_0001108 (note change from /po/terms/ to /to/terms/`)

SO, before we throw an error, we can try looking it up under the namespace of the currently loaded ontology instead...

bradfordcondon commented 6 years ago

Update on this: This term was fixed in the Trait ontology thanks to me bringing it up. The next error'd ssue is a relationship term which appears to be OK in the OBO. I am working on modifying the importer to just warn instead of Error if it cant look up the term correctly: once that feature is in, we'll be ready to try loading again.

bradfordcondon commented 6 years ago

yayyyy heres the plant trait ontology tree (TO not PTO for issue title...)

screen shot 2018-10-04 at 12 06 29 pm

beautiful!

we can load once my PR is merged or we can use the 680_warn_instead_of_error_no_term branch.

The code change was to warn instead of error if it cant find a term. In one case, hte plant trait ontology had an error which was fixed. the other errors are problems with the API call of the loader, so we might choose to wait until that gets fixed? very minor: 2 relationship ontology terms are missing: RO:0002310 and RO:0002577

bradfordcondon commented 5 years ago

hi all, this is finally unblocked.

almasaeed2010 commented 5 years ago

I think this is done? I don't see any pending PRs on core.

bradfordcondon commented 5 years ago

it's happily loaded here on live: https://hardwoodgenomics.org/cv/lookup/TO

bradfordcondon commented 5 years ago

https://www.ebi.ac.uk/ols/ontologies/to/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000951

The term for purple is a root term on our site. On EBI, its deep in the term tree.

This ontology is not loaded right.

almasaeed2010 commented 5 years ago

Reloading seems to fix. This is it on dev after reloading. I'll do the same for live.

Screen Shot 2019-04-15 at 12 52 40 PM
almasaeed2010 commented 5 years ago

done