vintasoftware / entity-embed

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
https://entity-embed.readthedocs.io/en/latest/
MIT License
144 stars 14 forks source link

How to use custom word embeddings? #36

Open havardox opened 1 year ago

havardox commented 1 year ago

Is it possible to use custom pre-trained word embeddings? The current ones all seem to be in English and I want to load embeddings in other languages.

Relevant section of docs: https://entity-embed.readthedocs.io/en/latest/guide/field_types.html#semantic-fields

fjsj commented 1 year ago

Yes it is if you get ones compatible with torchtext Vocab. The version of torchtext that the project uses is not the latest one, that's the issue I think.

Also note the semantic embedding part of the model is mostly straightforward: https://github.com/vintasoftware/entity-embed/blob/1bd9223c89aa451a48726258a95fa0ac1c089bb5/entity_embed/models.py#L55-L66

You may switch that for your own semantic embedding and train as you wish.