plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Multithreading for embeddings extraction #81

Open AFAgarap opened 3 years ago

AFAgarap commented 3 years ago

Hello. May I ask if there is a way to extract word embeddings using multiple cores? Right now, I'm getting the word embeddings representation for the 20 newsgroups dataset, and it still takes a while to complete the whole dataset. Thank you.

For reference, this is my current function,

def extract_sentence_embeddings(
    texts: str or List, batch_size: int = 2048
) -> np.ndarray:
    """
    Returns the sentence embeddings for the input texts.

    Parameter
    ---------
    texts: str or List
        The input text to vectorize.
    batch_size: int
        The mini-batch size to use for computation.

    Returns
    -------
    vectors: np.ndarray
        The sentence embeddings representation for the input texts.
    """
    vectorizer = pymagnitude.Magnitude("data/glove.840B.300d.magnitude")
    if isinstance(texts, str):
        vectors = vectorizer.query(texts.split())
        vectors = np.mean(vectors, axis=0)
        return vectors
    elif isinstance(texts, list):
        vectors = []
        for index in range(len(texts) // batch_size):
            offset = (index * batch_size) % len(texts)
            vector = vectorizer.query(
                list(
                    map(
                        lambda text: ["", ""]
                        if len(text.split()) == 0
                        else text.split(),
                        texts[offset : offset + batch_size],
                    )
                )
            )
            vector = np.mean(vector, axis=1)
            vectors.append(vector)
        return vectors

Since I'm using 300D vectors, the memory can easily be exhausted, that's why I opt for batching the text data.

Looking forward to your response! Thank you!