monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
615 stars 80 forks source link

Building OWL fails for multiple built-in templates - not recognising prefixes #325

Closed dosumis closed 9 months ago

dosumis commented 9 months ago

Example:

ontogpt extract --model gpt-4 -t gocam -i AT2_pulmonary_surfactant_response.txt -o AT2_pulmonary_surf_gocam.owl -O owl

Fails with:

lot_usage for undefined slot: id
File "<file>", line 623, col 19: Unrecognized prefix: rdfs
Unrecognized prefix: HGNC
Unrecognized prefix: PR
Unrecognized prefix: UniProtKB
Unrecognized prefix: PW
Unrecognized prefix: UBERON
Unrecognized prefix: NCBITaxon
Unrecognized prefix: EFO
Unrecognized prefix: CHEBI
Unrecognized prefix: biolink
: Unknown CURIE prefix: HGNC

Default (YAML) output works fine.

Files: AT2_pulmonary_surfactant_response.txt

dosumis commented 9 months ago

CC @hkir-dev

caufieldjh commented 9 months ago

Encountered an uncaught TypeError in the process of reproducing this, so I'm going to consider it part of the same issue:

Traceback (most recent call last):
  File "/home/harry/ontogpt/.venv/bin/ontogpt", line 6, in <module>
    sys.exit(main())
  File "/home/harry/ontogpt/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/ontogpt/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/harry/ontogpt/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/ontogpt/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/ontogpt/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/ontogpt/src/ontogpt/cli.py", line 355, in extract
    write_extraction(results, output, output_format, ke)
  File "/home/harry/ontogpt/src/ontogpt/cli.py", line 104, in write_extraction
    exporter.export(results, output, knowledge_engine.schemaview)
  File "/home/harry/ontogpt/src/ontogpt/io/owl_exporter.py", line 52, in export
    output.write(str(doc).encode("utf-8"))  # type: ignore
  File "/usr/lib/python3.10/codecs.py", line 377, in write
    data, consumed = self.encode(object, self.errors)
TypeError: utf_8_encode() argument 1 must be str, not bytes
caufieldjh commented 9 months ago

I think that TypeError was the main problem. Here's the output OWL now:

Prefix( owl: = <http://www.w3.org/2002/07/owl#> )
Prefix( rdf: = <http://www.w3.org/1999/02/22-rdf-syntax-ns#> )
Prefix( rdfs: = <http://www.w3.org/2000/01/rdf-schema#> )
Prefix( xsd: = <http://www.w3.org/2001/XMLSchema#> )
Prefix( xml: = <http://www.w3.org/XML/1998/namespace> )
Prefix( linkml: = <https://w3id.org/linkml/> )
Prefix( gocam: = <http://w3id.org/ontogpt/gocam/> )
Prefix( GO: = <http://purl.obolibrary.org/obo/GO_> )
Prefix( CL: = <http://purl.obolibrary.org/obo/CL_> )
Prefix( core: = <http://w3id.org/ontogpt/core/> )
Prefix( NCIT: = <http://purl.obolibrary.org/obo/NCIT_> )
Prefix( RO: = <http://purl.obolibrary.org/obo/RO_> )
Prefix( shex: = <http://www.w3.org/ns/shex#> )
Prefix( schema: = <http://schema.org/> )

Ontology( <http://w3id.org/ontogpt/gocam>
    AnnotationAssertion( rdfs:label HGNC:10798 "SFTPA1" )
    AnnotationAssertion( rdfs:label HGNC:10799 "SFTPA2" )
    AnnotationAssertion( rdfs:label HGNC:10801 "SFTPB" )
    AnnotationAssertion( rdfs:label HGNC:10802 "SFTPC" )
    AnnotationAssertion( rdfs:label HGNC:10803 "SFTPD" )
    AnnotationAssertion( rdfs:label HGNC:33 "ABCA3" )
    AnnotationAssertion( rdfs:label HGNC:14582 "LAMP3" )
    AnnotationAssertion( rdfs:label <http://purl.obolibrary.org/obo/GO_0009058> "synthesis" )
    AnnotationAssertion( rdfs:label <http://purl.obolibrary.org/obo/GO_0046903> "secretion" )
    AnnotationAssertion( rdfs:label <http://purl.obolibrary.org/obo/GO_0015914> "phospholipid transport" )
    AnnotationAssertion( rdfs:label <http://purl.obolibrary.org/obo/GO_0051235> "storage" )
    AnnotationAssertion( rdfs:label <AUTO:regulation%20of%20surfactant%20metabolism%20and%20innate%20immunity> "regulation of surfactant metabolism and innate immunity" )
    AnnotationAssertion( rdfs:label <AUTO:lowering%20surface%20tension%20in%20the%20alveoli%20and%20essential%20for%20normal%20respiratory%20function> "lowering surface tension in the alveoli and essential for normal respiratory function" )
    AnnotationAssertion( rdfs:label <AUTO:spreading%20and%20stability%20of%20the%20surfactant%20film%20at%20the%20air-liquid%20interface%20of%20the%20alveolar%20surface> "spreading and stability of the surfactant film at the air-liquid interface of the alveolar surface" )
    AnnotationAssertion( rdfs:label <AUTO:immune%20defense%20of%20the%20lungs%20and%20also%20plays%20a%20role%20in%20surfactant%20homeostasis> "immune defense of the lungs and also plays a role in surfactant homeostasis" )
    AnnotationAssertion( rdfs:label <AUTO:transports%20phospholipids%20into%20lamellar%20bodies%20in%20AT2%20cells> "transports phospholipids into lamellar bodies in AT2 cells" )
    AnnotationAssertion( rdfs:label <AUTO:involved%20in%20the%20storage%20and%20secretion%20of%20surfactant%20lipids%20and%20proteins%20from%20lamellar%20bodies> "involved in the storage and secretion of surfactant lipids and proteins from lamellar bodies" )
    AnnotationAssertion( rdfs:label <AUTO:synthesis%20of%20pulmonary%20surfactant> "synthesis of pulmonary surfactant" )
    AnnotationAssertion( rdfs:label <AUTO:secretion%20of%20pulmonary%20surfactant> "secretion of pulmonary surfactant" )
    AnnotationAssertion( rdfs:label <AUTO:storage%20of%20surfactant%20lipids%20and%20proteins> "storage of surfactant lipids and proteins" )
    AnnotationAssertion( rdfs:label <AUTO:lamellar%20bodies%20in%20AT2%20cells> "lamellar bodies in AT2 cells" )
    AnnotationAssertion( rdfs:label <http://purl.obolibrary.org/obo/GO_0042599> "lamellar bodies" )
)
caufieldjh commented 9 months ago

The remaining warnings about unrecognized prefixes are likely because they aren't defined in the gocam schema (and others). I'll fix that in its own PR. Otherwise the OWL should write as expected.

dosumis commented 9 months ago

Awesome. Thanks for being so quick to fix!