Documentation: provide example of custom LinkML model

matentzn commented 1 year ago

I tried

ontogpt extract -t path/to/templates/phenotype.yaml abstract.txt

but got:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/matentzn/ws/ontogpt/src/ontogpt/templates/path/to/templates/phenotype.yaml'`

cmungall commented 1 year ago

The argument for -t has to be the name of a template in the templates folder.

We should make it such that it can be any of:

a path to yaml (with compilation to pydantic happening dynamically)
a path to a folder with yaml and py
the name of a schema that is discoverable via a plugin framework

For now I would suggest making PRs and we will err on the side of over-inclusion of different people's schemas in the main repo (and people can always make long running forks if they don't want to share). As we may need to refactor a bit having them all mono style is probably best for now. But there will come a point where this won't scale... not there yet though!

nlharris commented 1 year ago

should i create a linkml enhancement ticket (in the linkml repo) for your "we should make it" list?

cmungall commented 1 year ago

this would be an ontogpt piece of functionality

DanNBullock commented 1 year ago

@matentzn I think I was able to get a custom one up and running in my fork. The specific yaml file can be found here. I did encounter a number of (undocumented?) challenges getting it up and running though, which I can recount here.

To the maintainers: If this is the incorrect place to document these, let me know and I can create a new issue, but it seems topically related to the implementation of custom LinkML models.

Challenges with implementing a custom LinkML model / ontology

both the use of make and the makefile and the use of datamodel-codegen has a tendency to mangle capitalization conventions of your field-names in the yaml file. This results in errors when the code attempts to populate the corresponding fields and/or validate the post-GPT query structure. I haven't dug too deeply into this, but the documentation of datamodel-codegen makes reference to snake-casing options, and potentially other settings, so there may be a parameter to avoid this behavior.
As was noted elsewhere (and referenced in this thread) the Makefile documentation is (was?) inaccurate.
The --cache-db option for extract is extremely useful for debugging and diagnosing issues arising during the implementation of the custom workflow. I think it's definitely something worth mentioning in a more prominent fashion in the documentation / readme.
The use of LinkML enum fields doesn't quite seem to work as viably as one might hope. It appears that this is because the descriptions that one may provide in the enum component of the yaml file don't actually make it to the compiled .py file and aren't actually seen by the GPT model in the resultant prompt, though the main field descriptions are. This results in responses that are in accordance with the main field description, but which violate the enum requirement, and so the resultant structure fails pydantic validation. Realistically though, this may be a boon though, considering the token limit for GPT prompt and response (4097).
Perhaps not specific to custom models / ontologies, but I couldn't find a pass-through parameter for response max_tokens which can cause issues with larger prompt models. Perhaps there's a way to enable a pass-through parameter or dynamically calculate the permissible remainder for the response?
I'm not entirely 100% sure about this (apologies, only minimal personal experience with module/package creation), but I think that there may be some additional legwork required on the part of the user in order to implement a custom model. Specifically, the cloning and use of the ontogpt repo is what enables the use of the Makefile (as I don't think this is included in the pip install ontogpt version?). However, making the ontogpt extract -t [model] -i [targetText] call (probably?) uses the ontogpt instance that is associated with your python environment (which may not be the cloned repo?). It's quite likely that this approach to using ontogpt is not the recommended method (e.g. there's a better way to associate the custom fork of ontogpt with the python environment), but my solution was then to copy over the environment's site-packages directory.

That's all that I can recall at the moment. Thank you to the maintainers for this powerful new workflow! Also, let me know if I should create a separate thread for this post.

serenalotreck commented 1 year ago

Not sure if this is the right place for this question -- is there a way to integrate a custom LinkML schema with the pip-installed version of the package, as opposed to needing to fork the repo?

nlharris commented 1 year ago

Hi @serenalotreck, this seems like a question that is not that closely related to this issue. Can you open a new issue, or try asking on the LinkML community Slack?

caufieldjh commented 1 year ago

This question is still related, but I'll transfer it to its own issue so I ensure it gets into the documentation.

monarch-initiative / ontogpt

Documentation: provide example of custom LinkML model #13

Challenges with implementing a custom LinkML model / ontology