huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.24k stars 222 forks source link

SetFit ABSA create very large matrix #547

Open Erland366 opened 2 months ago

Erland366 commented 2 months ago

Hello, I just learn about SetFit and now I want to use it for my ABSA usecase. I have 50.000 row of datasets which the maximum token per row is 511. When I use ABSATrainer for this dataset, I encounter this error :

  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/setfit/trainer.py", line 502, in get_dataloader
    data_sampler = ContrastiveDataset(
  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/setfit/sampler.py", line 68, in __init__
    self.generate_pairs()
  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/setfit/sampler.py", line 90, in generate_pairs
    for (_text, _label), (text, label) in shuffle_combinations(self.sentence_labels):
  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/setfit/sampler.py", line 29, in shuffle_combinations
    idxs = np.stack(np.triu_indices(n, k), axis=-1)
  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/numpy/lib/twodim_base.py", line 1113, in triu_indices
    tri_ = ~tri(n, m, k=k - 1, dtype=bool)
  File "/home/azhar/miniforge3/envs/preskripsi/lib/python3.10/site-packages/numpy/lib/twodim_base.py", line 414, in tri
    m = greater_equal.outer(arange(N, dtype=_min_int(0, N)),
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 46.6 GiB for an array with shape (223709, 223709) and data type bool
  1. How to solve this error? Is it because my row is too much? I saw other example in the github issue and it uses 200 rows. I tried 200 rows too but get the exact same error.
  2. I didn't really understand how SetFit works, hence I don't know what to do to change things so I can solve the error. So can you also explain it a bit on how does it works? Like I saw Contrastive in the training and the ~tri seems like a triangular matrix for masking no? Why masking requires huge dimensional matrix?