Inconsistant embeddings from Executor

bgonzalezfractal commented 1 year ago

Describe the bug JINA Executor from the hub is inconsistent using different methods. Having a dataset with 8000 examples of stable diffusion prompts, the embeddings generated from:

# create Document Array
da_custom = DocumentArray()
for idx, row in df.iterrows():
    doc = Document(text=row['prompt'].strip(), tags={'topic': row['label_txt'],
                                               'date': row['day'],
                                            'width': row['width'],
                                            'height': row['height'],
                                                    'image': row['image']})
    da_custom.append(doc)
# Local Executor from the hub
executor_custom.encode(da_custom, {})

Differ from:

# create Document Array
da_custom = DocumentArray()
# Loop through each item in the set, cada item del DataFrame es añadido al DocumentArray
for idx, row in df.iterrows():
    doc = Document(text=row['prompt'].strip(), tags={'topic': row['label_txt'],
                                               'date': row['day'],
                                            'width': row['width'],
                                            'height': row['height'],
                                                    'image': row['image']})
    docs = DocumentArray(doc)
    executor_custom.encode(docs,{})
    doc.embedding = docs[0].embedding
    da_custom.append(doc)

The embeddings would be exactly the same for every item in the docarray but, with the first method we get embeddings filled zeros in some examples:

While the second method that is applying exactly the same model on exactly the same text, we get:

As you can see they are cleary different but they use the same executor, this could greatly change results while developing, any ideas?

-- UPDATE: I' ve also tried doing in batches of a 1000, it works, it seems the problem is encoding the whole thing ? Thought the batches logic is not that intuitive, any ideas? So far 2 methods have worked, the only difference is the volume being encoded with the executor.

%%time

da_custom = DocumentArray()

factor = 1000
df.groupby(np.arange(len(df))//factor)
for k,g in df.groupby(np.arange(len(df))//factor):
    print(f'INDEX {k}')
    print(f'DF LEN {len(g)}')
    # Se crea arreglo de documentos (formato para búsqueda neuronal)
    da_tmp = DocumentArray(Document(text=x['prompt'].strip(), tags={'topic': x['label_txt'],'date': x['day'],
                                                               'width': x['width'],'height': x['height'],
                                                               'image': x['image']}) for idx, x in g.iterrows())
    executor_custom.encode(da_tmp,{})
    time.sleep(5)
    da_custom.extend(da_tmp)

Describe how you solve it I had to append DocumentArrays of length = 1 to encode them correctly.

Environment

jina 3.13.0
docarray 0.20.1
jcloud 0.1.6
jina-hubble-sdk 0.28.0
jina-proto 0.1.13
protobuf 4.21.12
proto-backend upb
grpcio 1.47.2
pyyaml 5.4.1
python 3.10.9
platform Darwin
platform-release 21.6.0
platform-version Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000
architecture arm64
processor arm
uid 73198290903005
session-id 64cd8050-876d-11ed-ba45-4292ce209fdd
uptime 2022-12-29T03:39:02.900723
ci-vendor (unset)
internal False
JINA_DEFAULT_HOST (unset)
JINA_DEFAULT_TIMEOUT_CTRL (unset)
JINA_DEPLOYMENT_NAME (unset)
JINA_DISABLE_UVLOOP (unset)
JINA_EARLY_STOP (unset)
JINA_FULL_CLI (unset)
JINA_GATEWAY_IMAGE (unset)
JINA_GRPC_RECV_BYTES (unset)
JINA_GRPC_SEND_BYTES (unset)
JINA_HUB_NO_IMAGE_REBUILD (unset)
JINA_LOG_CONFIG (unset)
JINA_LOG_LEVEL (unset)
JINA_LOG_NO_COLOR (unset)
JINA_MP_START_METHOD (unset)
JINA_OPTOUT_TELEMETRY (unset)
JINA_RANDOM_PORT_MAX (unset)
JINA_RANDOM_PORT_MIN (unset)
JINA_LOCKS_ROOT (unset)
JINA_K8S_ACCESS_MODES (unset)
JINA_K8S_STORAGE_CLASS_NAME (unset)
JINA_K8S_STORAGE_CAPACITY (unset)
JINA_STREAMER_ARGS (unset)

Screenshots Uploaded

JoanFM commented 1 year ago

Can you share the code of your custom Executor?

bgonzalezfractal commented 1 year ago

I'm sorry @JoanFM was not available for a while, this conversation occurred on Slack, for anyone facing issues with custom pytorch fine-tuned models uploaded to the hub, remember to apply this after the model downloads.

import torch
with torch.no_grad():
      executor.encode(da,{})

Thils will avoid gradient calculation when generating embeddings and you are good to go.

Great results using finetuned models, we had a match consistency of 70% and now we are up 85-90% with finetuned models.

jina-ai / jina

Inconsistant embeddings from Executor #5561