opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

Parse OBO file #533

Open julie-sullivan opened 4 years ago

julie-sullivan commented 4 years ago

BioJava has an OBO parser:

https://github.com/biojava/biojava/blob/biojava-4.1.0/biojava-ontology/src/main/java/org/biojava/nbio/ontology/obo/OboFileParser.java

Is it fit for purpose?

julie-sullivan commented 4 years ago

I can't get the OBO parser from BioJava to work at all.

Here's my unit test that passed:

        BufferedReader bufferedReader = FileUtils.newBufferedReader(Paths.get(getClass()
                .getResource("/hp.obo").getPath()));

        OboParser parser = new OboParser();
        Ontology ontology = parser.parseOBO(bufferedReader, "Human phenotype ontology",
                "The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities " +
                        "and clinical features encountered in human disease.");
        assertEquals(5, ontology.getTerms().size());

        Term term = ontology.getTerm("HP:0000001");
        assertEquals("HP:0000001", term.getName());
        assertEquals("All", term.getDescription());

        term = ontology.getTerm("HP:0000002");
        assertEquals("HP:0000002", term.getName());
        assertEquals("Abnormality of body height", term.getDescription());

Why is HP:0000002 the name instead of the ID? Why is the description the name?

Here's the snippet:

[Term]
id: HP:0000001
name: All
comment: Root of all terms in the Human Phenotype Ontology.
xref: UMLS:C0444868

[Term]
id: HP:0000002
name: Abnormality of body height
def: "Deviation from the norm of height with respect to that which is expected according to age and gender norms." [HPO:probinson]
synonym: "Abnormality of body height" EXACT layperson []
xref: UMLS:C4025901
is_a: HP:0001507 ! Growth abnormality
created_by: peter
creation_date: 2008-02-27T02:20:00Z

I followed the cookbook exactly, I must be doing something wrong? Because that's not what I would expect at all.

I am going to try another library.

julie-sullivan commented 4 years ago

I also tried to use the unit test to help me figure out how to parse:

https://github.com/biojava/biojava/blob/biojava-4.1.0/biojava-ontology/src/test/java/org/biojava/nbio/ontology/TestOboFileParsing.java

:(

julie-sullivan commented 4 years ago
[   {      "id":"HP:0001187",
      "name":"Hyperextensibility of the finger joints",
      "definition":"The ability of the finger joints to move beyond their normal range of motion.",
      "namespace":"human_phenotype",
      "synonyms":[
         "Finger joint hyperextensibility"

],
      "xrefs":[
         "UMLS:C1844577"

],
      "parents":[
         "HP:0006094"

]

},
   {      "id":"HP:0025154",
      "name":"Portosystemic collateral veins",
      "definition":"Presence of biliary veins that serve as a collateral channel to the systemic circulation",
      "namespace":"human_phenotype",
      "comment":"Venous blood returning from the small intestine, stomach, pancreas and spleen converges into the portal vein. The terminal branches of the hepatic portal vein and hepatic artery empty together and mix as they enter sinusoids in the liver. Conditions such as liver cirrhosis, in which scar tissue partially blocks the normal flow of blood, may increases the pressure in the portal vein (portal hypertension).When blood flow through a vessel or a vascular bed is obstructed due to occlusion, collateral pathways open up as blood bypasses the occlusion or obstruction, and this can lead to portosystemic collateral veins in the case of cirrhosis and some other hepatobiliary diseases.",
      "synonyms":[
         "Collateral biliary circulation"

],
      "parents":[
         "HP:0012440"

]

},
   {      "id":"HP:0025153",
      "name":"Transient",
      "definition":"Short-lived and not permanent. This term applies to a phenotypic abnormality that is temporary and of short duration.",
      "namespace":"human_phenotype",
      "parents":[
         "HP:0011008"

]

},
   {      "id":"HP:0025152",
      "name":"Poor visual behavior for age",
      "definition":"Lack of visual responsiveness or decrease in visual capabilities suggesting a lack of visual responsiveness or decrease in visual capabilities in an infant or young child in which visual behavior fails to meet normal developmental milestones.",
      "namespace":"human_phenotype",
      "comment":"A failure to meet age-related milestones in areas such as (i) focusing ability, (ii) eye coordinationg and tracking of objects in the visual field, (iii) depth perception, (iv) color perception, and (v) object and face recognition. These milestones are generally met in the first three months of life, and failure to meet them may indicate abnormal visual development or function.",
      "synonyms":[
         "Abnormal visual behavior for age"

],
      "parents":[
         "HP:0000504"

]

},
   {      "id":"HP:0001188",
      "name":"Hand clenching",
      "definition":"An abnormal hand posture in which the hands are clenched to fists. All digits held completely flexed at the metacarpophalangeal and interphalangeal joints.",
      "namespace":"human_phenotype",
      "comment":"Hand clenching is commonly characterized by malpositioning of the fingers characterized by radial deviation of the 4th and 5th digits and ulnar deviation of the 2nd digit over the 3rd finger. Hand clenching is distinguished from Camptodactyly, as that term may describe fewer than five digits of a eudactylous hand and does not involve the MCPJ. The digits may overlap when they lie flexed in the palm. It is not necessary to specify the overlapping fingers finding separately.",
      "synonyms":[
         "Clenched hands"

],
      "xrefs":[
         "UMLS:C0239815"

],
      "parents":[
         "HP:0005922"

]

}
]
julie-sullivan commented 4 years ago
switched to db cellbase_homo_sapiens_grch38_v4
> db.obo.count()
62395
> db.obo.findOne()
{
    "_id" : ObjectId("5e70dd7e299c136b01f0a74b"),
    "id" : "HP:0001187",
    "name" : "Hyperextensibility of the finger joints",
    "definition" : "The ability of the finger joints to move beyond their normal range of motion.",
    "namespace" : "human_phenotype",
    "synonyms" : [
        "Finger joint hyperextensibility"
    ],
    "xrefs" : [
        "UMLS:C1844577"
    ],
    "parents" : [
        "HP:0006094"
    ]
}

HPO currently contains over 13,000 terms