Dealing with large documents

angangwa commented 1 year ago

Hi, thanks for making the library so usable! Asking for suggestions.

Context:

Running SetFit on a small set of documents (financial; ~40 docs for 4 classes; ~10 docs per class for testing)
The documents in question are quite big (~10k+ characters), so only the beginning of the doc is used
Regardless, the performance is acceptable

Problem:

Analyzing the results, for e.g. with SHAP, its clear that we can squeeze better performance using the entire document.

Potential Approach:

Simplest idea is to break the document into chunks and train with those chunks. Run inference for all chunks and take a vote to decide the final class. An improvement would be to train a different "head" - that can take multiple chunks together (fixed number, as want to avoid LSTM).

Any experience / suggestions on this will greatly help :)

jaalu commented 1 year ago

This is akin to the approach suggested in BERT for Long Documents: A Case Study of Automated ICD Coding, which computes embeddings per-chunk and then applies per-class attention over the subsequent chunk embeddings to implement the "head" you describe.

angangwa commented 1 year ago

@jaalu thanks. Indeed we came across this, but as it happens, we didn't have the budget and time to build that ourselves!

I believe much simpler techniques can lead to an improvement. Building on #268 , improving creation of contrastive examples gives better results in our case:

Our docs are structured, so, we labelled parts of documents relevant to classification and only used those chunks during training. During inference, we filter out chunks with low confidence, and only vote on remaining chunks of text. This leads to an almost perfect accuracy in our case.
This will not scale as we move from less than 10 classes to more. And are on the lookout for techniques that can do the selection of important chunks automatically. Considering TextRank, clustering, etc.

EDIT: important = similar; whats common across all docs in the category.

huggingface / setfit

Dealing with large documents #275