huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.25k stars 223 forks source link

Dealing with large documents #275

Open angangwa opened 1 year ago

angangwa commented 1 year ago

Hi, thanks for making the library so usable! Asking for suggestions.

Context:

Problem:

Analyzing the results, for e.g. with SHAP, its clear that we can squeeze better performance using the entire document.

Potential Approach:

Simplest idea is to break the document into chunks and train with those chunks. Run inference for all chunks and take a vote to decide the final class. An improvement would be to train a different "head" - that can take multiple chunks together (fixed number, as want to avoid LSTM).

Any experience / suggestions on this will greatly help :)

jaalu commented 1 year ago

This is akin to the approach suggested in BERT for Long Documents: A Case Study of Automated ICD Coding, which computes embeddings per-chunk and then applies per-class attention over the subsequent chunk embeddings to implement the "head" you describe.

angangwa commented 1 year ago

@jaalu thanks. Indeed we came across this, but as it happens, we didn't have the budget and time to build that ourselves!

I believe much simpler techniques can lead to an improvement. Building on #268 , improving creation of contrastive examples gives better results in our case:

EDIT: important = similar; whats common across all docs in the category.