rapidsai / crossfit

Metric calculation library
Apache License 2.0
2 stars 6 forks source link

[IMP] Make `max_sequence_length` non optional for HF-Model Class #54

Closed VibhuJawa closed 3 months ago

VibhuJawa commented 5 months ago

We should max_sequence_length non optional for HF-Model Class non optional as its needed here:

https://github.com/rapidsai/crossfit/blob/1ee3de4af3aa543ded1041c2231ad01476b33103/crossfit/backend/torch/hf/model.py#L82-L118

https://github.com/rapidsai/crossfit/blob/1ee3de4af3aa543ded1041c2231ad01476b33103/crossfit/backend/torch/hf/model.py#L82

VibhuJawa commented 4 months ago

Another datapoint for this is issue max_sequence_length for the facebook-opt-123m model, where the model_max_length is set to an incorrect value (1000000000000000019884624838656) according to the HuggingFace PR. This seems to be a configuration error on HuggingFace's end, not an issue with our code.

We couldn't find a standardized method to determine max_sequence_length after some research. As a workaround, I suggest implementing an optional configuration for max_sequence_length in our semdedup code, defaulting to 512.

For more context:

Discussion on determining max model length on Stack Overflow Related discussion on HuggingFace

VibhuJawa commented 3 months ago

Fixed by:https://github.com/rapidsai/crossfit/commit/0cc29937115f9858272c51e8faffe31155915901