Closed VibhuJawa closed 3 months ago
Another datapoint for this is issue max_sequence_length
for the facebook-opt-123m model
, where the model_max_length is set to an incorrect value (1000000000000000019884624838656) according to the HuggingFace PR. This seems to be a configuration error on HuggingFace's end, not an issue with our code.
We couldn't find a standardized method to determine max_sequence_length after some research. As a workaround, I suggest implementing an optional configuration for max_sequence_length in our semdedup code, defaulting to 512.
For more context:
Discussion on determining max model length on Stack Overflow Related discussion on HuggingFace
We should
max_sequence_length
non optional for HF-Model Class non optional as its needed here:https://github.com/rapidsai/crossfit/blob/1ee3de4af3aa543ded1041c2231ad01476b33103/crossfit/backend/torch/hf/model.py#L82-L118
https://github.com/rapidsai/crossfit/blob/1ee3de4af3aa543ded1041c2231ad01476b33103/crossfit/backend/torch/hf/model.py#L82