ubc-systopia / Indaleko

Indaleko Project
GNU Affero General Public License v3.0
0 stars 1 forks source link

import_bulk fails to import large input files #41

Open hadisinaee opened 9 months ago

hadisinaee commented 9 months ago

When trying to use the import_bulk API from python-arango, I noticed that it fails to import all the docs. The following is my input to the function:

res=collection.import_bulk(
                documents=map(lambda x: x.to_dict(), documents),
                overwrite=self.reset_collection
                )

I have to call .to_dict on all objects because they are the IndalekoObjects class. To make it JSON serializable, we should create a dictionary from them.

The error I get is:

Exception: Can't connect to host(s) within limit (3)

The size of the document is 827481.

fsgeek commented 9 months ago

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

hadisinaee commented 8 months ago

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Yeah, the arangoimport can handle large files, but the API seems to be tricky to use.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

Yes, it might be that. I can try to build the array first and pass it to the function. If it didn't work properly, I'd go then and then simply run the arangoimport from my python script. I'll give it a try.

hadisinaee commented 8 months ago

I worked on this issue and tried the following methods: