jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
20.99k stars 2.22k forks source link

Inconsistant embeddings from Executor #5561

Closed bgonzalezfractal closed 1 year ago

bgonzalezfractal commented 1 year ago

Describe the bug JINA Executor from the hub is inconsistent using different methods. Having a dataset with 8000 examples of stable diffusion prompts, the embeddings generated from:

# create Document Array
da_custom = DocumentArray()
for idx, row in df.iterrows():
    doc = Document(text=row['prompt'].strip(), tags={'topic': row['label_txt'],
                                               'date': row['day'],
                                            'width': row['width'],
                                            'height': row['height'],
                                                    'image': row['image']})
    da_custom.append(doc)
# Local Executor from the hub
executor_custom.encode(da_custom, {})

Differ from:

# create Document Array
da_custom = DocumentArray()
# Loop through each item in the set, cada item del DataFrame es añadido al DocumentArray
for idx, row in df.iterrows():
    doc = Document(text=row['prompt'].strip(), tags={'topic': row['label_txt'],
                                               'date': row['day'],
                                            'width': row['width'],
                                            'height': row['height'],
                                                    'image': row['image']})
    docs = DocumentArray(doc)
    executor_custom.encode(docs,{})
    doc.embedding = docs[0].embedding
    da_custom.append(doc)

The embeddings would be exactly the same for every item in the docarray but, with the first method we get embeddings filled zeros in some examples:

image

While the second method that is applying exactly the same model on exactly the same text, we get:

image

As you can see they are cleary different but they use the same executor, this could greatly change results while developing, any ideas?

-- UPDATE: I' ve also tried doing in batches of a 1000, it works, it seems the problem is encoding the whole thing ? Thought the batches logic is not that intuitive, any ideas? So far 2 methods have worked, the only difference is the volume being encoded with the executor.

%%time

da_custom = DocumentArray()

factor = 1000
df.groupby(np.arange(len(df))//factor)
for k,g in df.groupby(np.arange(len(df))//factor):
    print(f'INDEX {k}')
    print(f'DF LEN {len(g)}')
    # Se crea arreglo de documentos (formato para búsqueda neuronal)
    da_tmp = DocumentArray(Document(text=x['prompt'].strip(), tags={'topic': x['label_txt'],'date': x['day'],
                                                               'width': x['width'],'height': x['height'],
                                                               'image': x['image']}) for idx, x in g.iterrows())
    executor_custom.encode(da_tmp,{})
    time.sleep(5)
    da_custom.extend(da_tmp)    

Describe how you solve it I had to append DocumentArrays of length = 1 to encode them correctly.


Environment

Screenshots Uploaded

JoanFM commented 1 year ago

Can you share the code of your custom Executor?

bgonzalezfractal commented 1 year ago

I'm sorry @JoanFM was not available for a while, this conversation occurred on Slack, for anyone facing issues with custom pytorch fine-tuned models uploaded to the hub, remember to apply this after the model downloads.

import torch
with torch.no_grad():
      executor.encode(da,{})

Thils will avoid gradient calculation when generating embeddings and you are good to go.

Great results using finetuned models, we had a match consistency of 70% and now we are up 85-90% with finetuned models.