Refactor or better document loaders

cmungall commented 7 years ago

The ways in which ontologies and annotations are loaded is confusing and not well documented.

everything can be combined into one OWL file, which works for us in monarch but it not intuitive, and requires something like owltools to make the files
the yaml allows stratification of ontologies and instances TSVs. However, there are 2 kinds of TSVs: associations and labels. It seems only the former is supported
we need to make this extensible, e.g. to load frequency info for #56
we have phenopacket loading in a branch by @balhoff but not fully integrated yet?

ultimately we want to be able to fetch latest version from API, to avoid stale files in github

cc @jnguyenx @damiansm @kshefchek

julesjacobsen commented 7 years ago

migrating relevant comments over to here...

Whilst setting up the new code, I was able to create a new BMKnowledgeBase using the OWLLoader like so:

public BMKnowledgeBase bmKnowledgeBase() throws Exception {
        OWLLoader owlLoader = new OWLLoader();
        owlLoader.load("src/main/resources/ontologies/hp.obo");
        owlLoader.loadDataFromTsvGzip("src/main/resources/data/human-pheno.assocs.gz");
        return owlLoader.createKnowledgeBaseInterface();
    }

This loaded and started up OK, but when a set of HP terms (HP:0001156,HP:0001363,HP:0011304,HP:0010055) were passed in to the controller

@GetMapping(path = "phenodigm")
public MatchSet matchPhenodigm(@RequestParam(value = "id") Set<String> ids) throws IncoherentStateException {
        ProfileMatcher phenodigmProfileMatcher = PhenodigmICProfileMatcher.create(bmKnowledgeBase);
        logger.info("Created {} profile matcher", phenodigmProfileMatcher.getShortName());
        ProfileQuery query = ProfileQueryFactory.createQuery(ids);
        logger.info("Querying {} matcher with Class ids: {}", phenodigmProfileMatcher.getShortName(), query.getQueryClassIds());
        return phenodigmProfileMatcher.findMatchProfile(query);
    }

the following NPE was thrown

java.lang.NullPointerException: null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210) ~[guava-18.0.jar:na]
    at org.monarchinitiative.owlsim.kb.impl.BMKnowledgeBaseOWLAPIImpl.getIndexForClassNode(BMKnowledgeBaseOWLAPIImpl.java:740) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.kb.impl.BMKnowledgeBaseOWLAPIImpl.getIndex(BMKnowledgeBaseOWLAPIImpl.java:625) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.kb.impl.BMKnowledgeBaseOWLAPIImpl.getClassIndex(BMKnowledgeBaseOWLAPIImpl.java:634) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.compute.matcher.impl.AbstractProfileMatcher.getProfileSetBM(AbstractProfileMatcher.java:94) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.compute.matcher.impl.PhenodigmICProfileMatcher.findMatchProfileImpl(PhenodigmICProfileMatcher.java:66) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.compute.matcher.impl.AbstractProfileMatcher.findMatchProfileAll(AbstractProfileMatcher.java:223) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.owlsim.compute.matcher.impl.AbstractProfileMatcher.findMatchProfile(AbstractProfileMatcher.java:194) ~[owlsim-core-3.0-SNAPSHOT.jar:na]
    at org.monarchinitiative.controllers.MatchController.matchPhenodigm(MatchController.java:43) ~[classes/:na]

debugging the application it turned out this was because there was no curie mappings provided to the BMKnowledgebase, so this is a bit of an issue with the OWLLoader which provides no means of adding a set of mappings to the CurieUtil.

julesjacobsen commented 7 years ago

To fix the above, and to try and lead people through the creating of the BMKnowledgeBase, I've implemented a new OwlKnowlegeBase class. This is a bit of a mash-up of the OWLLoader and KnowledgeBaseModule which allows for fluent loading of a BMKnowledgeBase. For example

    OwlKnowledgeBase.loader()
            .loadCuries(curies())
            .loadOntology("src/main/resources/ontologies/hp.obo")
            .loadData("src/main/resources/data/human-pheno.assocs.gz")
            .createKnowledgeBase();

Where curies() defines a map {'HP' : 'http://purl.obolibrary.org/obo/HP_'} This partly addresses this issue. It's not fully ready yet, but it works - the main issue being that things need to be called in the correct order, which is clearly a bad thing.

drseb commented 7 years ago

Is the code of OwlKnowledgeBase available somewhere?

Also, what is the rational behind the format of the G2P annotations (human-pheno.assocs.gz) ? This is yet another format... can we create a public addX2Pannotation(...) method so that I can put content into the KB from my code?

julesjacobsen commented 7 years ago

Not yet available - I'm working on it, but will let you know when I'm done.

We need to think more on the actual import format used - I'm just taking what's currently in place and trying to consolidate code to reduce duplication. The loadData() method currently parses the data and loads it into an OWLOntology. It needn't have to take a file - it could accept a list of objects with an id (Individual) and a phenotype term (Class).

Alternatively, it might be worthwhile allowing people to define their own parsers and domain objects so long as these implement standard interfaces. That way you can load up a knowledgebase with any given set of data, providing they have a unique identifier and a set of associated phenotypes. Ultimately these all end up as asserted axioms of class to individual.

julesjacobsen commented 7 years ago

@drseb do you want to load these annotations as a Map or from file or both? I think it only really makes sense to load them in in bulk as they are all added to the ontology which is then passed to the BMKnowledgeBase. The data could either be a Map<String, List> where the key is the individual's identifier and the List are the curies to be associated with the individual. Alternatively a map or String key value pairs such that the individual's identifier is repeated with every class assertion. What would you prefer? @cmungall @jnguyenx - ideas?

drseb commented 7 years ago

Loading in bulk is totally ok. I would suggest to make the key of the Map generic - sometimes one may use int (entrez-id) and sometimes Strings (disease-id).

cmungall commented 7 years ago

We use URIs internally as identifiers. Anything coming in or our has to be a CURIE or a URI. We have all the prefixes you should need in the Monarch prefixmap, e.g. NCBIGene

drseb commented 7 years ago

And if my annotation set is a bunch of patients? Or papers?

julesjacobsen commented 7 years ago

I think you can define these in the CurieUtil initialisation map, making-up a namespace. So you could map PTNT:1, PTNT:2 etc. using the mapping "PTNT": "http://seb.org/PTNT_".

In the tests there are examples of gene identifiers just being used as is without any real URI, these are mapped to HP terms or whatever, so these will probably be OK, at least for matching purposes.

Papers presumably have a pubmed id or some such. These are mapped in Monarch using these identifiers:

# publication/reference sources
  'DOI' : 'http://dx.doi.org/'
  'GeneReviews' : 'http://www.ncbi.nlm.nih.gov/books/'  # diseases too
  'ISBN': 'https://monarchinitiative.org/ISBN_'
  'ISBN-10': 'https://monarchinitiative.org/ISBN10_'
  'ISBN-13': 'https://monarchinitiative.org/ISBN13_'
  'ISBN-15': 'https://monarchinitiative.org/ISBN15_'
  'J' : 'http://www.informatics.jax.org/reference/J:'  # MGI-internal identifiers for pubs
  'MPD':  'http://phenome.jax.org/'
  'MPD-assay': 'http://phenome.jax.org/db/qp?rtn=views/catlines&keymeas='
  'PMID': 'http://www.ncbi.nlm.nih.gov/pubmed/'
  'PMCID' : 'http://www.ncbi.nlm.nih.gov/pmc/'
  'AQTLPub' : 'http://www.animalgenome.org/cgi-bin/QTLdb/BT/qabstract?PUBMED_ID='
  'GO_REF' : 'http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:'

julesjacobsen commented 7 years ago

@drseb I've added a loadDataFromMap method for inserting individuals. Here is a complete code snippet for loading the HPO with a single ORPHA:710 individual:

Map<String, String> curies = new LinkedHashMap<>();
curies.put("ORPHA", "'http://www.orpha.net/ORDO/Orphanet_");
curies.put("HP", "http://purl.obolibrary.org/obo/HP_");

Map<String, Collection<String>> individuals = new LinkedHashMap<>();
individuals.put("ORPHA:710", Arrays.asList("HP:0000194",            
        "HP:0000218",                                               
        "HP:0000262",                                               
        "HP:0000303",                                               
        "HP:0000316",                                               
        "HP:0000322",                                               
        "HP:0000324",                                               
        "HP:0000348",                                               
        "HP:0000431",                                               
        "HP:0000470",                                               
        "HP:0000508",                                               
        "HP:0001156",                                               
        "HP:0001385",                                               
        "HP:0003307",                                               
        "HP:0004209",                                               
        "HP:0004322",                                               
        "HP:0005048",                                               
        "HP:0006101",                                               
        "HP:0009773",                                               
        "HP:0010669",                                               
        "HP:0011304",                                               
        "HP:0012368"));                                         

BMKnowledgeBase knowledgeBase = BMKnowledgeBase.owlLoader()
                .loadOntology("http://purl.obolibrary.org/obo/hp.owl")
                .loadCuries(curies)
                .loadDataFromMap(data)
                .createKnowledgeBase();

Will this work for you, or do you need some other way of loading in the individuals data?

monarch-initiative / owlsim-v3

Refactor or better document loaders #57