neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
7.59k stars 531 forks source link

embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1 #74

Closed shinthor closed 3 years ago

shinthor commented 3 years ago

Hello, when I try to run the indexing step, I get this error.

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-33-6e863ca8aecc> in <module>
----> 1 embeddings.index(to_index)

~\Anaconda3\envs\bert2\lib\site-packages\txtai\embeddings.py in index(self, documents)
     80 
     81         # Transform documents to embeddings vectors
---> 82         ids, dimensions, stream = self.model.index(documents)
     83 
     84         # Load streamed embeddings back to memory

~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in index(self, documents)
    245                 if len(batch) == 500:
    246                     # Convert batch to embeddings
--> 247                     uids, dimensions = self.batch(batch, output)
    248                     ids.extend(uids)
    249 

~\Anaconda3\envs\bert2\lib\site-packages\txtai\vectors.py in batch(self, documents, output)
    279 
    280         # Build embeddings
--> 281         embeddings = self.model.encode(documents, show_progress_bar=False)
    282         for embedding in embeddings:
    283             if not dimensions:

~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    192 
    193             with torch.no_grad():
--> 194                 out_features = self.forward(features)
    195 
    196                 if output_value == 'token_embeddings':

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
    117     def forward(self, input):
    118         for module in self:
--> 119             input = module(input)
    120         return input
    121 

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\sentence_transformers\models\Transformer.py in forward(self, features)
     36             trans_features['token_type_ids'] = features['token_type_ids']
     37 
---> 38         output_states = self.auto_model(**trans_features, return_dict=False)
     39         output_tokens = output_states[0]
     40 

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    962         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    963 
--> 964         embedding_output = self.embeddings(
    965             input_ids=input_ids,
    966             position_ids=position_ids,

~\Anaconda3\envs\bert2\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~\Anaconda3\envs\bert2\lib\site-packages\transformers\models\bert\modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    205         if self.position_embedding_type == "absolute":
    206             position_embeddings = self.position_embeddings(position_ids)
--> 207             embeddings += position_embeddings
    208         embeddings = self.LayerNorm(embeddings)
    209         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1

Where to_index =

 [('0015023cc06b5362d332b3baf348d11567ca2fbb',
  'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3\nword count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a single large open reading 28 frame flanked by 5′ and 3′ untranslated regions (UTRs). Foot-and-mouth disease virus (FMDV) 29 has an unusually large 5′ UTR (1.3 kb) containing five structural domains. These include the 30 internal ribosome entry site (IRES), which facilitates initiation of translation, and the cis-acting 31 replication element (cre). Less well characterised structures are a 5′ terminal 360 nucleotide 32 stem-loop, a variable length poly-C-tract of approximately 100-200 nucleotides and a series of 33 two to four tandemly repeated pseudoknots (PKs). We investigated the structures of the PKs 34 by selective 2′ hydroxyl acetylation analysed by primer extension (SHAPE) analysis and 35 determined their contribution to genome replication by mutation and deletion experiments. 36 SHAPE and mutation experiments confirmed the importance of the previously predicted PK 37 structures for their function. Deletion experiments showed that although PKs are not essential 38',
  None),
 ('00340eea543336d54adda18236424de6a5e91c9d',
  'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications\nDuring the past three months, a new coronavirus (SARS-CoV-2) epidemic has been growing exponentially, affecting over 100 thousand people worldwide, and causing enormous distress to economies and societies of affected countries. A plethora of analyses based on viral sequences has already been published, in scientific journals as well as through non-peer reviewed channels, to investigate SARS-CoV-2 genetic heterogeneity and spatiotemporal dissemination. We examined all full genome sequences currently available to assess the presence of sufficient information for reliable phylogenetic and phylogeographic studies. Our analysis clearly shows severe limitations in the present data, in light of which any finding should be considered, at the very best, preliminary and hypothesis-generating. Hence the need for avoiding stigmatization based on partial information, and for continuing concerted efforts to increase number and quality of the sequences required for robust tracing of the epidemic.',
  None),
 ('004f0f8bb66cf446678dc13cf2701feec4f36d76',
  'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China\n',
  None), ...]

How do I fix this? I don't see anywhere in the documentation about this. I assume the error message:

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. 
Default to no truncation.

is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don't see anywhere how to do that...

davidmezzetti commented 3 years ago

Thank you for reporting this issue!

Can you share the transformer model you are using for this test?

shinthor commented 3 years ago

That is using

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "scoring": "bm25"})
davidmezzetti commented 3 years ago

Thank you. It looks like sentence-transformers 1.x no longer defaults the tokenizer.max_seq_length, which is fine for models that have a model_max_length set. In the case of the model here, it does not, hence the error.

I will add an additional config option to optionally set the tokenizer.max_seq_length to address this issue.

In the meantime you have a couple different options.

  1. Use a different model that has a model_max_length, for example "sentence-transformers/stsb-distilbert-base"
  2. Downgrade to sentence-transformers==0.4.1
  3. Wait for the next version of txtai to use this model and set maxlength=512 manually

After the fix it will look something like this:

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512})
shinthor commented 3 years ago

Thank you, you're absolutely right! Downgrading sentence-transformers to 0.4.1 did the trick for now!

davidmezzetti commented 3 years ago

Just committed a fix for this. You can now also try installing the latest master branch pip install git+https://github.com/neuml/txtai

Otherwise, it will be in txtai 3.0

shinthor commented 3 years ago

Just committed a fix for this. You can now also try installing the latest master branch pip install git+https://github.com/neuml/txtai

Otherwise, it will be in txtai 3.0

Actually pip install git+https://github.com/neuml/txtai still gives me the error, whether or not I used pip install sentence-transformers==0.4.1 or used the most recent version of sentence-transformers. I think there may be an issue with the fix you committed? The only combination that has worked for me was the pypi version of txtai with sentence-transformers version 0.4.1

davidmezzetti commented 3 years ago

Did you run it like this?

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512})

The sentence-tranformers change is actually an improvement as defaulting to 128 tokens isn't the best strategy. But it does require an extra parameter if the model doesn't have the model_max_length parameter set.

shinthor commented 3 years ago

You're right, it does work if I include it like that. However, it doesn't work for the Similarity pipeline if I use Similarity("gsarti/scibert-nli") or Similarity("bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16") then I get the same error

davidmezzetti commented 3 years ago

Thank you for confirming this works for Embeddings.

This Similarity issue is similar but unrelated to this issue. Each pipeline needs to add a kwargs of {truncation:true} when otherwise not specified. I've added #79 to address this.

shinthor commented 3 years ago

Any pointers on how to do this? I've noticed that strangely Similarity("gsarti/covidbert-nli") works fine in my code but not Similarity("gsarti/scibert-nli") despite both being BERT models with "max_position_embeddings": 512, in their config.json

shinthor commented 3 years ago

I had tried to copy what you did in the Summary pipeline to labels.py in my fork:

kwargs = {"truncation": True, "max_length": 512, "max_seq_length": 512}
# Run ZSL pipeline
results = self.pipeline(text, labels, multi_label=multilabel, **kwargs)

to seemingly no change in result

davidmezzetti commented 3 years ago

It's really about the Tokenizer.

>>> tokenizer = AutoTokenizer.from_pretrained("gsarti/scibert-nli")
>>> tokenizer.model_max_length
1000000000000000019884624838656

Something that may work for your local testing:

kwargs = {"truncation": True,}
# Run ZSL pipeline
self.pipeline.tokenizer.model_max_length = 512
results = self.pipeline(text, labels, multi_label=multilabel, **kwargs)

I think transformers is assuming this field is always set with the tokenizer. Currently, max_length can't be overridden in calls to the tokenizer for hf pipelines. May be a worthwhile change upstream with transformers to allow that field to be set. https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/base.py#L638

davidmezzetti commented 3 years ago

Going to reopen this issue. I think I can make this less manual. Plan to remove the maxlength parameter I just added and attempt to copy the max_position_embeddings config parameter if the tokenizer doesn't have a max_model_length set.

davidmezzetti commented 3 years ago

Just committed a fix that should address both embeddings and pipelines. The maxlength parameter is no longer needed, it will take the max_position_embeddings config parameter when it's not detected in the tokenizer.

shinthor commented 3 years ago

I can confirm that as of the latest commit, my code using a pretrained Bluebert model works without running into the issue! Thanks for the fix!

davidmezzetti commented 3 years ago

Great, glad to hear it, appreciate your help on this!