huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.24k stars 222 forks source link

Check if max pairs limit reached in `generate_pairs` and `generate_multilabel_pairs` #549

Closed OscarRunsCode closed 2 months ago

HuggingFaceDocBuilderDev commented 2 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tomaarsen commented 2 months ago

Hello!

Apologies for the delay. I think this direction makes a lot of sense. I'm actually a bit surprised that I didn't think of this initially. Previously, if you set max_pairs to e.g. 20, you would often get e.g. 22 pairs or something - which is just very confusing/surprising. This only got worse the larger the discrepancy between classes.

I merged your changes with some other recent changes from this week (e.g. moving away from the InputExample class), and then I rewrote it all somewhat. I think the direction is a bit clearer now, while still being equivalent to what you wrote. If max_pairs is not set, then the pairs are also still sampled like normal. All good there.

Thanks for putting this together and bringing it to my attention!