jxmorris12 / cde

code for training & evaluating Contextual Document Embedding models
MIT License
120 stars 4 forks source link

Same embeddings irrespective of context? #2

Open surabhisnath opened 2 weeks ago

surabhisnath commented 2 weeks ago

Hi,

I am trying to get contextual text embeddings. For example, get the embedding for a string "cat" under multiple different contexts for example (1) context "pets", (2) context "nuclear physics", etc. I want to investigate how the distances between strings differ based on the context (for instance, we expect the distance between "cat" and "dog" to be different in the context of pets vs in the context of nuclear physics).

I tried to use your model by using various context texts to get dataset_embeddings, and then embed my strings to obtain doc_embeddings under each dataset_embeddings. As follows:

wikipedia_contexts = {"pet": "A pet, or companion animal, is an animal kept primarily for a person's company or entertainment rather than as a working animal, livestock, or a laboratory animal.",
"nuclearphysics": "Nuclear physics is the field of physics that studies atomic nuclei and their constituents and interactions, in addition to the study of other forms of nuclear matter."}

for contextname, contexttext in wikipedia_contexts.items():
    # 3. First stage: embed the context docs
    dataset_embeddings = model.encode(
        [contexttext],
        prompt_name="document",
        convert_to_tensor=True,
    )

    # 4. Second stage: embed the docs
    doc_embeddings = model.encode(
        textset,
        prompt_name="document",
        dataset_embeddings=dataset_embeddings,
        convert_to_tensor=True,
    )

However, I find all doc_embeddings, to be all exactly the same - ie, doc_embeddings are the same irrespective of dataset_embeddings.

Is that expected or am I doing something wrong here? How else could I achieve the behaviour I expect with your model?

Thanks!

jxmorris12 commented 2 weeks ago

@surabhisnath This is clearly a bug in the HuggingFace port of the model. I'm going to investigate and fix it for you. In the meantime, if you load the model through this library it will work correctly!

surabhisnath commented 2 weeks ago

Thank you. Sounds great. Please let me know when fixed :)