Add more GPT-4 experiments

( All "sections" references below belong to the "GPT-4 Experiments notebook", I wasn't able to hyperlink it :( )

This PR addresses the following comments:

@ravwojdyla

Somewhere in here, is there a performance broken down by class? I essentially curious to check for Simpson's paradox.

See notebook section 1.2

Apart from maybe debugging some specific cases to make sure there's no big issues, if we want GPT tagger to perform better, prompt engineering would certainly be an interesting improvement. And specifically the CoT https://github.com/related-sciences/nxontology-ml/pull/21#user-content-fn-1-5fad6721aad68a63d474ea48c7193973 Re https://github.com/related-sciences/nxontology-ml/pull/20 and other potential issues with GPT skipping or duplicating records

See notebook section 4

(Gist: Adding CoT didn't change the performance much. Note that I changed your prompt a bit - see prompts/rav_cot_precision_v1.txt) Please tell me what you think but it might be worth doing another pass at CoT prompts at a later time (i.e. after the workshop next week)

maybe we could include a CSV (in the prompt) with specific record IDs and missing precision column (there are only 2 columns), and ask GPT to fill in the precision in the CSV, I suspect it would be much less likely for GPT to produce "corrupted" output.

I ended up cutting that corner and postponing this idea since it was causing significant changes to the way the prompts are currently injected/constructed.

@eric-czech

Do you have some examples you can share before scaling up? Specifically, I mean logs of prompts being sent and responses received for them. I think this is all within the realm of expectation, but a second look could help.

See notebook section 2.1

I'm not opposed to adding more few-shot examples. It would help to have some examples like I mentioned above ... we might be able to add in "difficult" cases as a part of the prompt to some net positive effect.

See notebook section 1.3 (and similarly 2.3)

Additionally, I wanted to see the impact of varying the number of completions ("choices") and see the impact on MAE.

(gist: Not surprisingly the more choices the better performance, although not drastic).

Nicely organized @yonromai 👍. That notebook is a big help.

Do you have some examples you can share before scaling up?

See notebook section 2.1

Looking at a prompt and a corresponding completion doesn't turn up anything immediately concerning for me. Those classifications from GPT4 make sense for the most part.

Looking at some of the misclassifications didn't surface anything alarming either. I break them into 3 cases:

Our "true" label is wrong; here are 3 examples:
- "Malformation syndrome with hamartosis" (Orphanet:98196) should be 02-disease-root not 03-disease-area
- "neurotoxicity" (EFO:0011057) should be 02-disease-root (or maybe 03-disease-area .. close call), not 01-disease-subtype; I agree w/ GPT4 in calling this a "low" precision term given only the label/description
- "papillary carcinoma" (EFO:1000646) should be 02-disease-root not 03-disease-area
The GPT4 label is inexplicable given the term details
- e.g. "Osteosclerosis - ichthyosis - premature ovarian failure" does not suggest a low precision label
The GPT4 label is right and the "true" label is actually determined by EFO implementation details (i.e. the structure)
- Malformation syndrome with hamartosis is a good example of this where high precision labels from GPT4 are a great choice given only our definition of the task and the term details, yet the existence of descendants for this term in EFO would dictate it's probably better as a medium term.

Overall, I would say I agree with ~80% of the GPT4 classifications in these misclassified examples. To be clear, this does not suggest that 80% of those true labels are wrong -- making this classification using only labels + descriptions is a somewhat orthogonal task to doing it within the confines of EFO's structure (i.e. lots of these fall into case # 3 above).

I do think we can add these misclassified examples to the prompt though:

- id: Orphanet:75325
  label: Osteosclerosis - ichthyosis - premature ovarian failure
  definition: NA
  precision: high
- id: EFO:0007237
  label: dipetalonemiasis
  definition: A filariasis that is a zoonotic infection caused by the nematode of the genus Dipetalonema.
  precision: high
- id: MONDO:0013742
  label: familial mesial temporal lobe epilepsy with febrile seizures
  definition: NA
  precision: high
- id: MONDO:0002321
  label: sensory peripheral neuropathy
  definition: Inflammation or degeneration of the sensory nerves.
  precision: medium
- id: EFO:1000646
  label: papillary carcinoma
  definition: A malignant epithelial neoplasm characterized by a papillary growth pattern.
  precision: medium
- id: EFO:0011057
  label: neurotoxicity
  definition: Toxicity that causes injury to the central or peripheral nervous system or damages its function.
  precision: low

That would increase our few-shot example count from 9 to 15.

Ditto - awesome work @yonromai!

Most high distance from true label nodes are of class 01-disease-subtype

So that means, GPT would frequently assign low precision to something that is actually high precision? afaiu that would fall into the 2nd bucket in Eric's misclassification classification. Here's one example of that from the notebook:

{'dist': 6, 'efo_definition': 'A viral infectious disease that results_in infection in ' 'sheep and rarely humans, has_material_basis_in Louping ' 'ill virus, which is transmitted_by sheep tick, Ixodes ' 'ricinus. The infection has_symptom lethargy, has_symptom ' 'muscle pains, has_symptom fever, and has_symptom focal ' 'neurological signs.', 'efo_id': 'EFO:0007348', 'efo_label': 'louping ill', 'precisions': ['low', 'low', 'low'], 'true_label': '01-disease-subtype'},

I'm surprised GPT would assign low precision here, just high level based on the description it, it does sound like high precision disease, given the details. It would be interesting to have the reasoning/CoT for this case, for debugging purposes. I guess we do see that in the per class scores:

Thanks for those btw ^ 🙏

Please tell me what you think but it might be worth doing another pass at CoT prompts at a later time (i.e. after the workshop next week

Thanks for giving that a try, do you happen to have the explanations available somewhere? Like for the EFO:0007348 above. WRT further prompt tuning (after the workshop):

I defer to Eric/Daniel about how much we want to optimize this - I suspect not much
to actually make an educated decision about this, it would be worth to take a look at the actual explanations from the current CoT prompts and look for patterns that we could improve, at least that worked well for me in the past. Otherwise we would be doing a "shotgun prompt tuning" ™️ ?

(I'd like to rebase this notebook on an upcoming change so I'll merge this PR and open a new one soon with updates)

related-sciences / nxontology-ml

Add more GPT-4 experiments #24