COREML BERT Crashing on long text

huggingface / swift-coreml-transformers

Swift Core ML 3 implementations of GPT-2, DistilGPT-2, BERT, and DistilBERT for Question answering. Other Transformers coming soon!

Apache License 2.0

1.62k stars 179 forks source link

COREML BERT Crashing on long text #6

Open heysaik opened 5 years ago

heysaik commented 5 years ago

For documents with lots of words, BERT ends up crashing outputting the error Fatal error: 'try!' expression unexpectedly raised an error: App.TokenizerError.tooLong("Token indices sequence length is longer than the specified maximum\nsequence length for this BERT model (784 > 512. Running this\nsequence through BERT will result in indexing errors\".format(len(ids), self.max_len)")

How do you solve this or is BERT only available for paragraphs which a less number of words? Can we increase the maxLen to 1024 or even 2048 or would that not work?

julien-c commented 5 years ago

Increasing the maxLen wouldn't work as it's dependent on the model itself.

One way to work around this would be to split your paragraph into slices of up to maxLen, potentially overlapping.

heysaik commented 5 years ago

If I do that, then won't I get a bunch of answers for a particular question based on each paragraph? How would I know which answer to choose from?

julien-c commented 5 years ago

You can just compare the output logits values and take the max

heysaik commented 5 years ago

How do you get these values? prediction only outputs start, end, tokens, and answer.

Sorry for all the questions, I'm not a huge expert in the neural nets of machine learning. 😅

julien-c commented 5 years ago

Hmm, yeah, you would need to dive into the code and implement it. It's not going to work out of the box unfortunately.

mbalfakeih commented 5 years ago

Has anyone made a method for doing this? I have looked online and have been unable to find anything