Closed danstan5 closed 1 year ago
Hi, I'm very interested in your trick. How did you modify the code to add remove_duplicate_samples
? Thank you in advance!
@domitix I believe #259 implements the changes to reproduce these results.
@danstan5 Great analysis! I've personally always had some issues with the pair generation's naive approach. It's very interesting to see these results in practice.
@tomaarsen following on from your comments in #259 I'm inclined to explore more your idea of generating the "proper" list, but perhaps go even more fundamental:
num_iterations=20
, num_samples=16
, classes=2
β 640 samples will be generatedunique_pairs
approach (suggested in #259), upto 528 samples (~16 iterations) training is just under-sampled (against total data we have available).In this case as we start to duplicate samples, is this not the same over-fitting to these randomly extra selected pairs?
Would it be better to just do away with this sampling iterations concept and take the true total. combinations as "no. of samples" then β num_epochs
or learning_rate
for fitting?
Be keen to know your thoughts, although I think testing will help to validate this!
In my example, num_samples=16
was the total number of samples across the two classes. It was inspired by a test case from the first code block in the README. Having 16 total labeled sentences results in 136 unique pairs (or 120 without identical pairs), despite 640 samples being generated. I believe this is quite backwards, as I don't think a single epoch should train the same samples multiple times.
If we implement unique_pairs
, then I believe that we should take the total number of unique pairs as a strict maximum on the number of samples/steps per epoch. In our example, that would be 120 (or 136). If extra training is required, then the number of epochs should be incremented. That way, no pairs get extra weight and we lose the odd behaviour that training is frequently done in 1 epoch containing duplicate training samples.
Optionally, we may give a warning if num_iterations * num_samples * 2 > n_unique_pairs
, and limit the number of samples to n_unique_pairs
. I'm wary of this however, as this warning is pretty much impossible to avoid if you want to train for all unique pairs.
cc: @lewtun thoughts on an optional unique_pairs
parameter? See also #259 for additional discussion.
Closing this as the advantages that come through better sampling and natural limits have been addressed in #268.
Note: in hindsight alot of the speed up that came from removing duplicate samples was because add_data_augmentation
adds no. of new samples = no. of existing samples (this doubles the dataset size with lots of duplicates!).
Therefore what the analysis really shows it that at higher sample sizes, adding lots of duplicate augmented data has no impact on accuracy, but takes significantly longer to train.
Background
I've enjoyed using the SetFit library to get great results on text-classification tasks, thank you core developers for your work on this π However with larger datasets I'm working on (>100 classes, >20k samples) I found the training times to be slow, especially if wanting to hyperparameter tune as well.
Proposal
The vast majority of the training time is in contrastive learning stage which relates to the number of sentence-pairs included. There is probably a lot of ways to engineer the contrastive sentence-pairs, but in looking to improve training times a simple solution I fell upon (that fit nicely into the existing API) was just to remove duplicate pairs by adding a
remove_duplicate_samples
parameter in the setfittrain
method.Testing
Using the
run_fewshot.py
script (with some alterations to setup the no-duplicate runs) I have run this comparison on the test_set datasets with the following parameters:num_iterations=40
(instead of 20) on the no duplicate run - these tests run so much quicker anyway seemed generous to boost this slightlyA100-SXM4-40GB
Commands for reproducibility
`Original` ``` python scripts/setfit/run_fewshot.py --num_iterations=20 --batch_size=32 --train_time=true --is_test_set=true --add_data_augmentation=true ``` `no_duplicate_samples` ``` python scripts/setfit/run_fewshot.py --num_iterations=40 --batch_size=32 --train_time=true --is_test_set=true --add_data_augmentation=true --remove_duplicate_samples=true --exp_name=remove-dups ```Analysis
ag_news
(acc)emotion
(acc)enron_spam
(acc)SentEval-CR
(acc)sst5
(acc)amazon_counterfactual_en
(matthews correlation)Original
no_duplicate_samples
ag_news
(s)emotion
(s)enron_spam
(s)SentEval-CR
(s)sst5
(s)amazon_counterfactual_en
(s)Original
no_duplicate_samples
SetFit % change accuracy:
Original
βno_duplicate_samples
SetFit training time difference:
Original
/no_duplicate_samples
Highlights
Removing duplicates typically removes ~60-98% of the samples. Even after doubling
num_iterations
, the observed training times are 2-8x faster.Despite this, the results stay on average within 12% of the previous accuracy.
At larger sample sizes (e.g. 32) the average accuracy slightly improved (probably a benefit of higher no. of iterations) while training on average 4x faster!
Other comments
At lower sample sizes (<16) the higher no. of iterations could be overfitting on a few samples. It's worth noting the main benefit of
remove_duplicate_samples
is reducing training times which are not an issue on few sample datasets.The considerably faster training times + hyperparameter tuning could probably be used find better
lr
&num_iterations
parameters to improve accuracy.Tests ran on A100 GPUs. On CPUs/ lower spec GPU's you would expect runtime differences to get larger...