Closed shinthor closed 3 years ago
Thank you for reporting this issue!
Can you share the transformer model you are using for this test?
That is using
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "scoring": "bm25"})
Thank you. It looks like sentence-transformers 1.x no longer defaults the tokenizer.max_seq_length, which is fine for models that have a model_max_length set. In the case of the model here, it does not, hence the error.
I will add an additional config option to optionally set the tokenizer.max_seq_length to address this issue.
In the meantime you have a couple different options.
After the fix it will look something like this:
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512})
Thank you, you're absolutely right! Downgrading sentence-transformers to 0.4.1 did the trick for now!
Just committed a fix for this. You can now also try installing the latest master branch pip install git+https://github.com/neuml/txtai
Otherwise, it will be in txtai 3.0
Just committed a fix for this. You can now also try installing the latest master branch
pip install git+https://github.com/neuml/txtai
Otherwise, it will be in txtai 3.0
Actually pip install git+https://github.com/neuml/txtai
still gives me the error, whether or not I used pip install sentence-transformers==0.4.1
or used the most recent version of sentence-transformers. I think there may be an issue with the fix you committed? The only combination that has worked for me was the pypi version of txtai with sentence-transformers version 0.4.1
Did you run it like this?
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512})
The sentence-tranformers change is actually an improvement as defaulting to 128 tokens isn't the best strategy. But it does require an extra parameter if the model doesn't have the model_max_length parameter set.
You're right, it does work if I include it like that. However, it doesn't work for the Similarity pipeline if I use Similarity("gsarti/scibert-nli") or Similarity("bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16") then I get the same error
Thank you for confirming this works for Embeddings.
This Similarity issue is similar but unrelated to this issue. Each pipeline needs to add a kwargs of {truncation:true} when otherwise not specified. I've added #79 to address this.
Any pointers on how to do this? I've noticed that strangely Similarity("gsarti/covidbert-nli")
works fine in my code but not Similarity("gsarti/scibert-nli")
despite both being BERT models with "max_position_embeddings": 512,
in their config.json
I had tried to copy what you did in the Summary pipeline to labels.py
in my fork:
kwargs = {"truncation": True, "max_length": 512, "max_seq_length": 512}
# Run ZSL pipeline
results = self.pipeline(text, labels, multi_label=multilabel, **kwargs)
to seemingly no change in result
It's really about the Tokenizer.
>>> tokenizer = AutoTokenizer.from_pretrained("gsarti/scibert-nli")
>>> tokenizer.model_max_length
1000000000000000019884624838656
Something that may work for your local testing:
kwargs = {"truncation": True,}
# Run ZSL pipeline
self.pipeline.tokenizer.model_max_length = 512
results = self.pipeline(text, labels, multi_label=multilabel, **kwargs)
I think transformers is assuming this field is always set with the tokenizer. Currently, max_length can't be overridden in calls to the tokenizer for hf pipelines. May be a worthwhile change upstream with transformers to allow that field to be set. https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/base.py#L638
Going to reopen this issue. I think I can make this less manual. Plan to remove the maxlength parameter I just added and attempt to copy the max_position_embeddings config parameter if the tokenizer doesn't have a max_model_length set.
Just committed a fix that should address both embeddings and pipelines. The maxlength parameter is no longer needed, it will take the max_position_embeddings config parameter when it's not detected in the tokenizer.
I can confirm that as of the latest commit, my code using a pretrained Bluebert model works without running into the issue! Thanks for the fix!
Great, glad to hear it, appreciate your help on this!
Hello, when I try to run the indexing step, I get this error.
Where to_index =
How do I fix this? I don't see anywhere in the documentation about this. I assume the error message:
is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don't see anywhere how to do that...