monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
548 stars 68 forks source link

Multilanguage gpt #369

Closed cmungall closed 1 month ago

caufieldjh commented 2 months ago

Some of the grounding is still fuzzier than it needs to be - or rather, it shouldn't include partial matches. Example:

input file name correct diagnosis id    correct diagnosis name  predicted diagnosis ids predicted diagnosis names   top1_match  any_match
PMID_4045952_IIIPMID_4045952_III.11-prompt.txt  OMIM:301310 Anemia, sideroblastic, and spinocerebellar ataxia   OMIM:MTHU001678|OMIM:MTHU005171|OMIM:557000|OMIM:234200 X-linked sideroblastic anemia with ataxia|Mitochondrial myopathy and sideroblastic anemia 1|Pearson marrow-pancreas syndrome|Neurodegeneration with brain iron accumulation 1   0   0
PMID_4045952_IVPMID_4045952_IV.55-prompt.txt    OMIM:301310 Anemia, sideroblastic, and spinocerebellar ataxia   OMIM:MTHU001678|OMIM:MTHU005171|OMIM:557000|OMIM:234200|OMIM:MTHU004242|OMIM:606829|OMIM:MTHU037698 X-linked sideroblastic anemia with ataxia|Mitochondrial myopathy and sideroblastic anemia 1|Pearson marrow-pancreas syndrome|Neurodegeneration with brain iron accumulation 1|Spinocerebellar Ataxia, Autosomal Recessive 1|Friedreich Ataxia with Retained Reflexes|Congenital Sideroblastic Anemia B  0   0
PMID_4045952_IVPMID_4045952_IV.1212-prompt.txt  OMIM:301310 Anemia, sideroblastic, and spinocerebellar ataxia   OMIM:MTHU001678|OMIM:MTHU005171|OMIM:557000|OMIM:234200|OMIM:MTHU004242|OMIM:606829|OMIM:MTHU037698|OMIM:300322|OMIM:MTHU036349|OMIM:615159|OMIM:MTHU011117 X-linked sideroblastic anemia with ataxia|Mitochondrial myopathy and sideroblastic anemia 1|Pearson marrow-pancreas syndrome|Neurodegeneration with brain iron accumulation 1|Spinocerebellar Ataxia, Autosomal Recessive 1|Friedreich Ataxia with Retained Reflexes|Congenital Sideroblastic Anemia B|Lesch-Nyhan syndrome|Ataxia-Telangiectasia|Mitochondrial complex III deficiency, nuclear type 4|Biotinidase deficiency   0   0
PMID_4045952_IVPMID_4045952_IV.1313-prompt.txt  OMIM:301310 Anemia, sideroblastic, and spinocerebellar ataxia   OMIM:MTHU001678|OMIM:MTHU005171|OMIM:557000|OMIM:234200|OMIM:MTHU004242|OMIM:606829|OMIM:MTHU037698|OMIM:300322|OMIM:MTHU036349|OMIM:615159|OMIM:MTHU011117|OMIM:208900|OMIM:277460|OMIM:MTHU068459 X-linked sideroblastic anemia with ataxia|Mitochondrial myopathy and sideroblastic anemia 1|Pearson marrow-pancreas syndrome|Neurodegeneration with brain iron accumulation 1|Spinocerebellar Ataxia, Autosomal Recessive 1|Friedreich Ataxia with Retained Reflexes|Congenital Sideroblastic Anemia B|Lesch-Nyhan syndrome|Ataxia-Telangiectasia|Mitochondrial complex III deficiency, nuclear type 4|Biotinidase deficiency|Ataxia-Telangiectasia-like Disorder|Ataxia with Vitamin E Deficiency|Ataxia, Early-Onset, with Oculomotor Apraxia and Hypoalbuminemia 0   0

OMIM:MTHU001678 shouldn't be in these because that's just "X-linked", or a partial match.

caufieldjh commented 2 months ago

Additional steps/options:

caufieldjh commented 2 months ago

This will be run from https://github.com/monarch-initiative/malco

caufieldjh commented 2 months ago

Also don't want to match to grouping classes like OMIM:MTHU068459 (Ataxia)

caufieldjh commented 1 month ago

To improve grounding:

caufieldjh commented 1 month ago

Easiest way to get phenotypic series from OMIM is to relax the ID filter.

caufieldjh commented 1 month ago

I believe this should do everything malco needs right now - if so, I will merge. If we want to attempt gene inference then that can go in its own PR.

justaddcoffee commented 1 month ago

I'm seeing this error when running this on ontogpt 21813e4 on a small test set:

(.venv) ~/PythonProject/malco_new/prompts add_code_to_dl_phenopacket_store $ poetry update
Updating dependencies
Resolving dependencies... (29.9s)

Package operations: 0 installs, 1 update, 0 removals

  • Updating ontogpt (0.3.11 5b4159e -> 0.3.11 21813e4)

Writing lock file
(.venv) ~/PythonProject/malco_new/prompts add_code_to_dl_phenopacket_store $ ontogpt run-multilingual-analysis --output=test.yaml /Users/jtr4v/PythonProject/malco_new/prompts/et/ /Users/jtr4v/PythonProject/malco_newoutputdir/
WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.
WARNING:ontogpt.engines.knowledge_engine:GPT4All client not available. GPT4All support will be disabled.
Traceback (most recent call last):
  File "/Users/jtr4v/PythonProject/malco_new/.venv/bin/ontogpt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/ontogpt/cli.py", line 1208, in run_multilingual_analysis
    multilingual_analysis(input_data_dir=input_data_dir,
  File "/Users/jtr4v/PythonProject/malco_new/.venv/lib/python3.11/site-packages/ontogpt/utils/multilingual.py", line 30, in multilingual_analysis
    output = codecs.open(output, "wb", encoding="utf-8")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 918, in open
TypeError: expected str, bytes or os.PathLike object, not LazyFile
caufieldjh commented 1 month ago

aha, thought I had fixed that error, but evidently not. Fix incoming.

caufieldjh commented 1 month ago

@justaddcoffee please give it another try

justaddcoffee commented 1 month ago

thanks @caufieldjh! works as advertised now

caufieldjh commented 1 month ago

Excellent. Merging.