ncbo / umls2rdf

These python scripts connect to the Unified Medical Language System (UMLS) database and translate the ontologies into RDF/OWL files. This is part of the BioPortal project.
http://bioportal.bioontology.org/
Other
82 stars 37 forks source link

Still "writing" MTHSPL triples after 24 hrs, even with 244 GB RAM #29

Open turbomam opened 5 years ago

turbomam commented 5 years ago

I'm running the umls2rdf script on an Ubuntu 16 AWS EC2 server. I bump the RAM up to 128 GB when I'm doing this. I have extracted several other, larger sources with zero or minimal difficulty. I'm using UMLS 2018AA. I'm extracting on CUIs.

I haven't done any MySQL tuning, but the SQL portion of the extraction goes quickly... less than 5 minutes, I think. I have tried to do this with the MTHSPL content combined with other sources in a single mmsys extract/MySQL database, and I have also tried doing MTHSPL in a database all by itself, which has been helpful with some of the other sources.

The triples writing has been going for over 1 day, but I don't think the Turtle file's size has grown beyond roughly 400 MB in the last 10 hours. top shows the python process at 100% CPU but a pretty small RAM usage... ~ 10 GB, I think.

select count(distinct CUI) from MRCONSO; in a MTHSPL-only database says there are 58,041 CUIs used by MTHSPL. I have loaded the Turtle content that I have after one day into a triplestore, and that only shows 3,633 CUIs from MTHSPL.

PREFIX umls: <http://bioportal.bioontology.org/ontologies/umls/>
select (count(distinct ?o) as ?count)
where
{
    graph <https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MTHSPL/> {
        ?s umls:cui ?o
    }
}
turbomam commented 5 years ago

I'm trying again now with UMLS 2019AA and fresh pull of umls2rdf.

Python 2.7 and Ubuntu 18 on an AWS EC2 x1e.2xlarge instance with 8 virtual CPUs, 244 GB RAM, and solid state storage provisioned at 4500 IOPS.

I set it up with MTHSPL as the only source:

MTHSPL,MTHSPL_only.ttl,load_on_cuis

It's been running for about 45 minutes now, most of that time completely idle. 0% CPU activity and 0 bytes/second disk activity.

$ grep -c 'owl:Class' MTHSPL_only.ttl
189
$ ls -lh MTHSPL_only.ttl
-rw-rw-r-- 1 ubuntu ubuntu 7.1M Jun  2 00:38 MTHSPL_only.ttl

head:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix umls: <http://bioportal.bioontology.org/ontologies/umls/> .

<http://purl.bioontology.org/ontology/MTHSPL/>
    a owl:Ontology ;
    rdfs:comment "RDF Version of the UMLS ontology MTHSPL; converted with the UMLS2RDF tool (https://github.com/ncbo/umls2rdf), developed by the NCBO project." ;
    rdfs:label "MTHSPL" ;
    owl:imports <http://www.w3.org/2004/02/skos/core> ;
    owl:versionInfo "2019aa" .

<http://purl.bioontology.org/ontology/MTHSPL/C3486878> a owl:Class ;
        skos:prefLabel """CALCIUM FLUORIDE 30 [hp_C] in 1 mL ORAL PELLET [Sore Throat]"""@en ;
        skos:notation """C3486878"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0596235> ;
        <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0038636> ;
        <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0006695> ;
        <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0022949> ;
        <http://purl.bioontology.org/ontology/MTHSPL/DM_SPL_ID> """36500"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/LABELER> """Natural Health Supply"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/LABEL_TYPE> """HUMAN OTC DRUG LABEL"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_CATEGORY> """Unapproved homeopathic"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_EFFECTIVE_TIME_LOW> """19980604"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_STATUS> """active"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/NDC> """64117-748-02"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/NDC> """64117-748-01"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/SPL_SET_ID> """e8ec7791-c5b4-4f16-8a64-2cadb203800e"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/MTHSPL/UNAPPROVED_HOMEOPATHIC> """N/A"""^^xsd:string ;
        umls:cui """C3486878"""^^xsd:string ;
        umls:tui """T200"""^^xsd:string ;
        umls:hasSTY <http://purl.bioontology.org/ontology/STY/T200> ;
 .

tail:

<http://purl.bioontology.org/ontology/MTHSPL/C3818362> a owl:Class ;
    skos:prefLabel """ACETALDEHYDE 12 [hp_X] in 59 mL / ARSENIC TRIOXIDE 12 [hp_X] in 59 mL / BALSAM PERU 12 [hp_X] in 59 mL / OYSTER SHELL CALCIUM CARBONATE, CRUDE 12 [hp_X] in 59 mL / PHENOL 12 [hp_X] in 59 mL / CONIUM MACULATUM FLOWERING TOP 12 [hp_X] in 59 mL / COUMARIN 12 [hp_X] in 59 mL / SAFFRON 12 [hp_X] in 59 mL / HISTAMINE DIHYDROCHLORIDE 12 [hp_X] in 59 mL / LACHESIS MUTA VENOM 12 [hp_X] in 59 mL / LYCOPODIUM CLAVATUM SPORE 12 [hp_X] in 59 mL / PHOSPHORUS 12 [hp_X] in 59 mL / SEPIA OFFICINALIS JUICE 12 [hp_X] in 59 mL ORAL LIQUID [Allergies Fragrances and Phenolics]"""@en ;
    skos:notation """C3818362"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0019588> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3696061> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0031705> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0031705> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3489013> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C2346854> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0010206> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0043047> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0070570> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3487991> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3486868> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0010206> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0543456> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3487991> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0052416> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0032841> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0000966> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0070477> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0070477> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3486868> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C0000966> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3484409> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3484411> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0070570> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3484409> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C3696061> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0724556> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3484411> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_active_moiety> <http://purl.bioontology.org/ontology/MTHSPL/C3489013> ;
    <http://purl.bioontology.org/ontology/MTHSPL/has_inactive_ingredient> <http://purl.bioontology.org/ontology/MTHSPL/C0725616> ;
    <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_EFFECTIVE_TIME_LOW> """20140324"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/LABELER> """King Bio Inc."""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/SPL_SET_ID> """9f673dc6-70e3-48f2-90a3-f38fc3a142d8"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/NDC> """57955-2205-2"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/LABEL_TYPE> """HUMAN OTC DRUG LABEL"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_CATEGORY> """Unapproved homeopathic"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/MARKETING_STATUS> """active"""^^xsd:string ;
    <http://purl.bioontology.org/ontology/MTHSPL/DM_SPL_ID> """237619"""^^xsd:string ;
    umls:cui """C3818362"""^^xsd:string ;
    umls:tui """T200"""^^xsd:string ;
    umls:hasSTY <http://purl.bioontology.org/ontology/STY/T200> ;
 .

After 9 hours

ubuntu@ip-172-31-94-83:/terabytes/umls2rdf/output$ grep -c 'owl:Class' MTHSPL_only.ttl
3412

That's ~ 350 classes/hour

https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MTHSPL/ says

MTHSPL contains approximately 148,045 drug products and 20,739 substances.

Is this really going to take 170,000/350 = 485 hours!?

turbomam commented 5 years ago

If I set debug mode to True

length atoms: 169417
Traceback (most recent call last):
  File "./umls2rdf.py", line 744, in <module>
    ont.load_tables()
  File "./umls2rdf.py", line 497, in load_tables
    sys.stderr.write("length atoms_by_aui: %d\n" % len(self.atoms_by_aui))
AttributeError: 'UmlsOntology' object has no attribute 'atoms_by_aui'