paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.67k stars 275 forks source link

Issues with BERTopic #70

Open ericchagnon15 opened 1 year ago

ericchagnon15 commented 1 year ago

I am trying to use Galactica as the embedding model for BERTopic and have tried a variety of methods to load and use the model but have encountered errors at each method.

Using the transformers library

I first tried to use the pipeline method from the transformers library with the following usage:

from transformers import *
from bertopic import BERTopic

galactica = pipeline("feature-extraction", model = "facebook/galactica-1.3b")
topics, probs = BERTopic(embedding_model=pipeline_gal ,nr_topics='auto', verbose = True).fit_transform(docs)

This results in the following TypeError: TypeError Traceback (most recent call last) in 1 # NOT WORKING ----> 2 topics, probs = BERTopic(embedding_model=galactica_model ,nr_topics='auto', verbose = True).fit_transform(docs)

9 frames /usr/local/lib/python3.8/dist-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y) 337 self.embedding_model = select_backend(self.embedding_model, 338 language=self.language) --> 339 embeddings = self._extract_embeddings(documents.Document, 340 method="document", 341 verbose=self.verbose)

/usr/local/lib/python3.8/dist-packages/bertopic/_bertopic.py in _extract_embeddings(self, documents, method, verbose) 2785 embeddings = self.embedding_model.embed_words(documents, verbose) 2786 elif method == "document": -> 2787 embeddings = self.embedding_model.embed_documents(documents, verbose) 2788 else: 2789 raise ValueError("Wrong method for extracting document/word embeddings. "

/usr/local/lib/python3.8/dist-packages/bertopic/backend/_base.py in embed_documents(self, document, verbose) 67 that each have an embeddings size of m 68 """ ---> 69 return self.embed(document, verbose)

/usr/local/lib/python3.8/dist-packages/bertopic/backend/_hftransformers.py in embed(self, documents, verbose) 58 59 embeddings = [] ---> 60 for document, features in tqdm(zip(documents, self.embedding_model(dataset, truncation=True, padding=True)), 61 total=len(dataset), disable=not verbose): 62 embeddings.append(self._embed(document, features))

/usr/local/lib/python3.8/dist-packages/tqdm/std.py in iter(self) 1193 1194 try: -> 1195 for obj in iterable: 1196 yield obj 1197 # Update and possibly print the progressbar.

/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py in next(self) 122 123 # We're out of items within a batch --> 124 item = next(self.iterator) 125 processed = self.infer(item, **self.params) 126 # We now have a batch of "inferred things".

/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py in next(self) 123 # We're out of items within a batch 124 item = next(self.iterator) --> 125 processed = self.infer(item, **self.params) 126 # We now have a batch of "inferred things". 127 if self.loader_batch_size is not None:

/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py in forward(self, model_inputs, forward_params) 988 with inference_context(): 989 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device) --> 990 model_outputs = self._forward(model_inputs, forward_params) 991 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu")) 992 else:

/usr/local/lib/python3.8/dist-packages/transformers/pipelines/feature_extraction.py in _forward(self, model_inputs) 81 82 def _forward(self, model_inputs): ---> 83 model_outputs = self.model(**model_inputs) 84 return model_outputs 85

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1193 or _global_forward_hooks or _global_forward_pre_hooks): -> 1194 return forward_call(input, **kwargs) 1195 # Do not call functions when jit is used 1196 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'token_type_ids'

Using galai load_model

When I tried to use the load_model function built into galai the output of the loading indicating the model was downloaded properly. However, the verbose output of BERTopic indicated that the default embedding model that comes with BERTopic was used instead of the loaded galactica model. More examples of this can be seen in the github issue I raised with BERTopic here. The author of the package indicated that using pipeline from the transformers library was the proper usage for using language models from huggingface.

Using flair

Finally I tried to load the model via flare. Similar to the results when I used the transformers library I was met with an error on compiling, but this time it was a value error. This usage resulted in the following ValueError:

from flair.embeddings import TransformerDocumentEmbeddings
from bertopic import BERTopic

flair_gal = TransformerDocumentEmbeddings('facebook/galactica-1.3b')
galactica_topics, galactica_probs = BERTopic(embedding_model=flair_gal ,nr_topics='auto', verbose = True).fit_transform(clean_docs) 

0%| | 0/499 [00:00<?, ?it/s]Using pad_token, but it is not set yet. 0%| | 0/499 [00:00<?, ?it/s]

ValueError Traceback (most recent call last) in 1 # NOT WORKING ----> 2 galactica_topics, galactica_probs = BERTopic(embedding_model=flair_gal ,nr_topics='auto', verbose = True).fit_transform(clean_docs)

10 frames /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py in _get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs) 2423 # Test if we have a padding token 2424 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0): -> 2425 raise ValueError( 2426 "Asking to pad but the tokenizer does not have a padding token. " 2427 "Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) "

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).