related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Add TA features to CatBoost model #34

Closed yonromai closed 11 months ago

yonromai commented 11 months ago

This PR introduces new logic to include Therapeutic Areas as features in the CatBoost model (see issue #33).

Gist: These new features capture a lot of useful signal to classify the labels (albeit significantly overlapping with the content of the text embedding features): image

Notes:

cc @eric-czech @dhimmel

eric-czech commented 11 months ago

Gist: These new features capture a lot of useful signal to classify the labels (albeit significantly overlapping with the content of the text embedding features)

This is really great to know. I suspect cancer, and the associated TA feature, is the biggest contributor to this effect.

I'm somewhat indifferent on whether or not to keep the embedding features now in a final model, cf. https://github.com/related-sciences/nxontology-ml/issues/8#issuecomment-1736098847. I lean towards keeping them though since I know there are a lot of text patterns in many parent vs child relationships that should be important for this classification, and propagating the text terms across the ontology structure may be the best way to exploit that in the future (e.g. via GNNs).