Hello. May I ask if there is a way to extract word embeddings using multiple cores?
Right now, I'm getting the word embeddings representation for the 20 newsgroups dataset, and it still takes a while to complete the whole dataset. Thank you.
For reference, this is my current function,
def extract_sentence_embeddings(
texts: str or List, batch_size: int = 2048
) -> np.ndarray:
"""
Returns the sentence embeddings for the input texts.
Parameter
---------
texts: str or List
The input text to vectorize.
batch_size: int
The mini-batch size to use for computation.
Returns
-------
vectors: np.ndarray
The sentence embeddings representation for the input texts.
"""
vectorizer = pymagnitude.Magnitude("data/glove.840B.300d.magnitude")
if isinstance(texts, str):
vectors = vectorizer.query(texts.split())
vectors = np.mean(vectors, axis=0)
return vectors
elif isinstance(texts, list):
vectors = []
for index in range(len(texts) // batch_size):
offset = (index * batch_size) % len(texts)
vector = vectorizer.query(
list(
map(
lambda text: ["", ""]
if len(text.split()) == 0
else text.split(),
texts[offset : offset + batch_size],
)
)
)
vector = np.mean(vector, axis=1)
vectors.append(vector)
return vectors
Since I'm using 300D vectors, the memory can easily be exhausted, that's why I opt for batching the text data.
Hello. May I ask if there is a way to extract word embeddings using multiple cores? Right now, I'm getting the word embeddings representation for the 20 newsgroups dataset, and it still takes a while to complete the whole dataset. Thank you.
For reference, this is my current function,
Since I'm using 300D vectors, the memory can easily be exhausted, that's why I opt for batching the text data.
Looking forward to your response! Thank you!