koaning / tokenwiser

Bag of, not words, but tricks!
https://koaning.github.io/tokenwiser/
Apache License 2.0
68 stars 7 forks source link

`BytePairLanguage.transform` returns an array of `nan` when transforming an empty string #52

Closed MBrouns closed 3 years ago

MBrouns commented 3 years ago

Not sure if you like the current behaviour, but it can be slightly annoying because subsequent models can break on this. I encountered this because lime does quite aggressive modifications that sometimes result in an empty string.

>>> from whatlies.language import BytePairLanguage
>>> BytePairLanguage('nl').fit(['foo', 'bar']).transform([""])

array([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan]], dtype=float32)