Experiment: Add GPT-4 labels as features in CatBoost model

yonromai commented 1 year ago

Added a step in the model pipeline which adds the three GPT-4 labels (because of asking for 3 completions per prompt in this config) as categorical columns in the CatBoost model.

Ran 2 experiments with GPT-4 labels:

3 gpt labels + topological features gpt4_mae
3 gpt labels + topological features + pca 64 of text embedding + subsets features (forgot to remove them from previous config) pca64_subsets_gpt4_mae

Notebook available here.

Next step: (Quickly) Play with the model parameters (learning rate, iterations & regularization) since these haven't been tuned for while and the feature set has significantly changed since.

cc @dhimmel @eric-czech

yonromai commented 1 year ago

Update:I ran a batch of experiments overnight to play with learning rate, iterations and regularization but the effect was quite minor

I'm not going to check in the code for the model tuning experiments so just leaving the graph here for posterity:

eric-czech commented 1 year ago

Ran 2 experiments with GPT-4 labels

Fascinating! I think we have our answer then in moving forward with the pca64 model, or some other model without the GPT4 features.

I suspect one explanation for the difference is that I know a lot of "keywords" make for good features, e.g. the occurrence of "neoplasm" is a feature I had hard-coded in our internal feature set because it often demarcates cancer terms in an uninformative way and I had chosen to make "neoplasm" terms low precision terms when they have a child term by essentially the same name as in this example: kidney cancer

Embeddings no doubt preserve information for those situations much more effectively. It's helpful to see that using both the GPT4 features and the embeddings simultaneously is not beneficial.

Anyways, this experimental design looks great to me and I love where it's ending up!

I'm intrigued by this comment in the notebook:

GPUs do not support custom eval

Did you run any of the experiments on GPUs? Would love to hear more about that if so.

yonromai commented 1 year ago

I suspect one explanation for the difference is that I know a lot of "keywords" make for good features, e.g. the occurrence of "neoplasm" is a feature I had hard-coded in our internal feature set because it often demarcates cancer terms in an uninformative way and I had chosen to make "neoplasm" terms low precision terms when they have a child term by essentially the same name as in this example: kidney cancer

This makes a lot of sense @eric-czech! Also the embedding space is very large (768 dim per vector) so it's possible for specialized words to drag the whole sentence vector in a very distinctive region of the space - making it feasible for the model discriminate the different classes more successfully.

Did you run any of the experiments on GPUs? Would love to hear more about that if so.

I did try to train with the GPU mode enabled (apparently CatBoost also leverages the Apple tensor hardware) but I wasn't able to use the custom MAE metric, which as you can see in the graph above (mae vs. baseline) is very useful during training. It might be possible to compile the metric locally to get it to work but I didn't spend much time on it.

I've been reflecting on these experimental results: It seems to me like there has been diminishing returns on investing more in feature engineering and we're hitting a performance ceiling of some sort.

If it turns out that we'd like a little more performance before using the model in the product, I think that working on labeling a little more records might be the best bang for the buck. If that's the way we want to go, we could easily re-purpose the current model to find which records are difficult to get right and label these in priority.

eric-czech commented 1 year ago

It seems to me like there has been diminishing returns on investing more in feature engineering and we're hitting a performance ceiling of some sort.

Sounds right to me.

I think that working on labeling a little more records might be the best bang for the buck.

My recommendation for a next step, outside of anything @dhimmel had in mind for the MONDO presentation, would be to run an inference/prediction pipeline on a current version of EFO and see how the classifications look for terms that weren't in the original labeled set (i.e. from the old EFO v3.43.0). I would happily spend some time analyzing those predictions to see what we can learn, if anything, from misclassifications.

yonromai commented 1 year ago

It seems to me like there has been diminishing returns on investing more in feature engineering and we're hitting a performance ceiling of some sort.

Sounds right to me.

I think that working on labeling a little more records might be the best bang for the buck.

My recommendation for a next step, outside of anything @dhimmel had in mind for the MONDO presentation, would be to run an inference/prediction pipeline on a current version of EFO and see how the classifications look for terms that weren't in the original labeled set (i.e. from the old EFO v3.43.0). I would happily spend some time analyzing those predictions to see what we can learn, if anything, from misclassifications.

Sounds good! I created a separate issue to discuss this ^ since I have a few questions but it's a bit unrelated to the current PR: https://github.com/related-sciences/nxontology-ml/issues/30

related-sciences / nxontology-ml

Experiment: Add GPT-4 labels as features in CatBoost model #28