keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
734 stars 216 forks source link

Preprocessor does not respect sequence_length #1627

Closed 52631 closed 1 month ago

52631 commented 1 month ago

Describe the bug If I initialize a preprocessor from preset it does not respect the specified sequence length.

To Reproduce In keras-nlp== 0.11.1, the preprocessor defaults to 512 regardless of specified length:

keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)("The quick brown fox jumped.")

Expected behavior In keras-nlp==0.8.2, the preprocess would respect specified length.

{'token_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=
 array([ 101, 1996, 4248, 2829, 4419, 5598, 1012,  102,    0,    0,    0,
           0,    0,    0,    0,    0], dtype=int32)>,
 'segment_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(16,), dtype=bool, numpy=
 array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        False, False, False, False, False, False, False])>}

Additional context In my case, this showed up as a large performance hit when migrating code to latest version. The performance penalty may be more subtle depending on the desired sequence length relative to the default value.

It seems the work around is to override the sequence length after initializing.

preprocessor = keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)
preprocessor.sequence_length = 16
SamanehSaadat commented 1 month ago

Thanks for reporting this issue! I'll look into this!

SamanehSaadat commented 1 month ago

This issue is fixed in #1632.