jina-ai / executors

internal-only
Apache License 2.0
31 stars 12 forks source link

MongoDBIndexer: use .tolist to nparrays #88

Closed cristianmtr closed 3 years ago

cristianmtr commented 3 years ago

Quoting from Slack:

Hey everyone, my colleague just found a bug in your MongoDBIndexer. Please use .tolist() instead of list() for converting a numpy array to a python list:

https://github.com/jina-ai/executors/blob/d943ed10f282081bb4a49ad00617de71fae81dfd/jinahub/indexers/storage/MongoDBStorage/mongo_handler.py#L53

https://github.com/jina-ai/executors/blob/d943ed10f282081bb4a49ad00617de71fae81dfd/jinahub/indexers/storage/MongoDBStorage/mongo_handler.py#L69

Bad:

item['embedding'] = list(doc.embedding.flatten()) # causes errors when inserting in db, because it creates a list of numpy scalars

Fix: item['embedding'] = doc.embedding.flatten().tolist() # the right way; creates a list of python floats

Error: InvalidDocument("cannot encode object: -0.43712485, of type: <class 'numpy.float32'>")

vivek2301 commented 3 years ago

Hi @cristianmtr,

Can you help me with an example code to reproduce this bug? I'll test it and reopen the PR.

sauravgarg540 commented 3 years ago

Hi @cristianmtr, I would like to work on this issue.

cristianmtr commented 3 years ago

Hey @sauravgarg540

Thanks for the initiative!

We haven't been able to reproduce the issue actually. It was reported to us by one of our EAPs. If you can reproduce it in a test, then we'd appreciate that and the PR to fix it.

sauravgarg540 commented 3 years ago

Hey @cristianmtr , happy to contribute.

I did some debugging and it doesn't seem like a bug to me. The issue description is correct about the list of numpy.float and python float but the reason why the issue is not reproducible is that it doesn't matter if the array being stored in the DB is an array of numpy float or python float, embedding is stored as an array of Double in the MongoDb and retrieved as float from the DB.

Please let me know if I am missing something.

cristianmtr commented 3 years ago

Then I guess we need to be consistent with the data types. If you can make a PR cleaning that up, it would be appreaciated.

sauravgarg540 commented 3 years ago

PR created #160

cristianmtr commented 3 years ago

Closed by #160