related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Add training data consistent shuffling #19

Closed yonromai closed 1 year ago

yonromai commented 1 year ago

TL;DR

The small and actual code change is here, the rest is only to update the tests.

This PR:

@dhimmel: This is not a very significant so I'll go ahead and merge. Feel free to comment here if you see something.

Moar Context

Because of (1) the cost of GPT-4 API calls and (2) the caching logic, I have been sorting the training set to get consistent caching when experimenting with GPT-4.

However, I noticed that there is a slight class imbalance when lexically sorting the nodes by id.

This PR improved the class imbalance issue on 1000 samples:

Before:

Sample class proportions: ({
   '01-disease-subtype': 0.433,
   '02-disease-root': 0.386, 
   '03-disease-area': 0.181
}, 1000)

Whole dataset class proportions: ({
   '01-disease-subtype': 0.5238459931769129,
   '02-disease-root': 0.3645477964213604, 
   '03-disease-area': 0.11160621040172666
}, 14363)

After:

Sample class proportions: ({
   '01-disease-subtype': 0.517, 
   '02-disease-root': 0.373, 
   '03-disease-area': 0.11
}, 1000)

Whole dataset class proportions: ({
   '01-disease-subtype': 0.5238459931769129, 
   '02-disease-root': 0.3645477964213604, 
   '03-disease-area': 0.11160621040172666
}, 14363)