The small and actual code change is here, the rest is only to update the tests.
This PR:
"Sorts" the nodes by hash of the id, ensuring both consistency and effective shuffling
Make this ^ behavior the default (& removes the shuffle and sort arguments): Virtually all consuming code either shuffles or sorts the nodes (for consistency). The new behavior satisfies both needs, therefore the parameters are not needed anymore.
@dhimmel: This is not a very significant so I'll go ahead and merge. Feel free to comment here if you see something.
Moar Context
Because of (1) the cost of GPT-4 API calls and (2) the caching logic, I have been sorting the training set to get consistent caching when experimenting with GPT-4.
However, I noticed that there is a slight class imbalance when lexically sorting the nodes by id.
This PR improved the class imbalance issue on 1000 samples:
TL;DR
The small and actual code change is here, the rest is only to update the tests.
This PR:
shuffle
andsort
arguments): Virtually all consuming code either shuffles or sorts the nodes (for consistency). The new behavior satisfies both needs, therefore the parameters are not needed anymore.@dhimmel: This is not a very significant so I'll go ahead and merge. Feel free to comment here if you see something.
Moar Context
Because of (1) the cost of GPT-4 API calls and (2) the caching logic, I have been sorting the training set to get consistent caching when experimenting with GPT-4.
However, I noticed that there is a slight class imbalance when lexically sorting the nodes by id.
This PR improved the class imbalance issue on 1000 samples:
Before:
After: