koaning / embetter

just a bunch of useful embeddings
https://koaning.github.io/embetter/
MIT License
465 stars 15 forks source link

[BUG] `device` should be attribute on `SentenceEncoder` #33

Closed CarloLepelaars closed 1 year ago

CarloLepelaars commented 1 year ago

The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder()
repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase):
    .
    .
    def __init__(self, name="all-MiniLM-L6-v2", device=None):
        if not device:
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.device = device
        self.name = name
        self.tfm = SBERT(name, device=self.device)
koaning commented 1 year ago

Feel free to make a PR. It might make sense to have the actual reference to the torch object to be self._device if that makes the __repr__ look better.

CarloLepelaars commented 1 year ago

Cool! I made PR #34 which solves this and adds some tests for SentenceEncoder. Defining self._device additionally does not seem necessary as __repr__ output is already clean.

As for testing text embedders it looks like Sense2VecEncoder is a lot harder to test since it depends on loading a file from disk. Any ideas to test that or do you think its not necessary?

koaning commented 1 year ago

There's no harm in adding 'self.path' I think. Feel free to make that PR as well!

CarloLepelaars commented 1 year ago

There's no harm in adding 'self.path' I think. Feel free to make that PR as well!

Cool! Created #35 for the Sense2Vec attribute fix.