langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
51.14k stars 7.36k forks source link

The value of embedding is stored in Redis, and when it is retrieved, it becomes float64. #7213

Open jingk-1992 opened 2 months ago

jingk-1992 commented 2 months ago

Self Checks

Dify version

v0.6.16

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Background: Set the vector library to chroma. Operations:

  1. Make a recall in the knowledge base. The first time you input the content "The Little Match Girl", it will return the recalled content. At this time, I see that the code has stored this key into Redis.
  2. Enter the same content "The Little Match Girl" for the second time. At this time, it will retrieve the previously stored item from Redis, but after retrieval, the data type is converted to float64.
  3. Taking this float64 embedding to query chroma, an error is reported during json.dumps. 1 2 3

✔️ Expected Behavior

The data stored in Redis is of the float type, and it should also be of the float type when retrieved, not the float64 type.

❌ Actual Behavior

The data stored in Redis is of the float type, but when retrieved, it turns out to be of the float64 type.

dosubot[bot] commented 2 months ago

Hey @jingk-1992! 👋 I'm here to help you with any bugs, questions, or contributions you might have for the project. Let me know how I can assist you!

The issue you're encountering is due to the data type conversion when retrieving embeddings from Redis. The embeddings are stored as base64-encoded byte strings and are converted back to numpy float64 arrays upon retrieval. This type conversion is causing the TypeError during JSON serialization.

To address this, you can modify the retrieval process to ensure the embeddings are converted to native Python float types before being used. Here's a potential solution:

  1. Retrieve the embedding from Redis:

    embedding = redis_client.get(embedding_cache_key)
    if embedding:
       redis_client.expire(embedding_cache_key, 600)
       embedding_array = np.frombuffer(base64.b64decode(embedding), dtype="float")
       embedding_list = embedding_array.tolist()  # Convert to native Python list of floats
       return embedding_list
  2. Ensure the embedding is a list of native Python floats:

    embedding_results = (embedding_results / np.linalg.norm(embedding_results)).tolist()

By converting the numpy array to a native Python list of floats using .tolist(), you can avoid the TypeError during JSON serialization.

Here is the updated code snippet for the embed_query method:

def embed_query(self, text: str) -> list[float]:
    """Embed query text."""
    # use doc embedding cache or store if not exists
    hash = helper.generate_text_hash(text)
    embedding_cache_key = f'{self._model_instance.provider}_{self._model_instance.model}_{hash}'
    embedding = redis_client.get(embedding_cache_key)
    if embedding:
        redis_client.expire(embedding_cache_key, 600)
        embedding_array = np.frombuffer(base64.b64decode(embedding), dtype="float")
        embedding_list = embedding_array.tolist()  # Convert to native Python list of floats
        return embedding_list
    try:
        embedding_result = self._model_instance.invoke_text_embedding(
            texts=[text],
            user=self._user
        )

        embedding_results = embedding_result.embeddings[0]
        embedding_results = (embedding_results / np.linalg.norm(embedding_results)).tolist()
    except Exception as ex:
        raise ex

    try:
        # encode embedding to base64
        embedding_vector = np.array(embedding_results)
        vector_bytes = embedding_vector.tobytes()
        # Transform to Base64
        encoded_vector = base64.b64encode(vector_bytes)
        # Transform to string
        encoded_str = encoded_vector.decode("utf-8")
        redis_client.setex(embedding_cache_key, 600, encoded_str)

    except IntegrityError:
        db.session.rollback()
    except:
        logging.exception('Failed to add embedding to redis')

    return embedding_results

This ensures that the embeddings are stored and retrieved as native Python floats, preventing the TypeError during JSON serialization [1].

To continue talking to Dosu, mention @dosu.