Closed riow1983 closed 1 year ago
I think your assumption sounds right, but keep in mind that the get_features was originally designed for image data and the BERT model was added later and not being updated since 2019/2020, so there is no guarantee it all works as expected.
I understand. Thank you for your reply.
I'd like to throw a feature request in order to get access to the guaranteed method.
I just wrote x = np.mean(x, axis=1)
to get mean pooling embedding. However, to be more precise I needed attention masks to exclude [PAD] tokens from the mean calculation.
You would do this in PyTorch for example:
x = (x * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(1, keepdim=True)
Hope this equivalent along with "transformer-friendly" method to get intermediate embeddings will be implemented in near future.
Thank you for the suggestion. We're currently in the process of switching underlying engine. After we complete, we'll consider what new features to support.
I want to confirm the way to get "BERT embedding" of each document using DLPy.
If I understand correctly, a BERT model compiled by DLPy inherits DLPy's model class and DLPy's model class does have
get_features
method. One can get intermediate layer's outputs viaget_features
(https://sassoftware.github.io/python-dlpy/generated/dlpy.model.Model.get_features.html) which means one can also get BERT embedding via the same method.Then, I assume
x
can be interpreted as BERT embedding of each document. Correct me if I'm missing something.