man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

mongodump and mongorestore library - Blob (not pure dataframe) #967

Open fengster123 opened 2 years ago

fengster123 commented 2 years ago

Arctic Version

# 1.80.0

Arctic Store

# VersionStore

Platform and version

Spyder (Python 3.8)

Description of problem and/or code sample that reproduces the issue

Hi, I use mongodump and mongorestore to move libraries in between PCs (let me know if there are easier ways). So for each library (in this case, my library is called "attribution_europe_data"), it has 5 collections (from MongoDB's point of view), which are attribution_europe_data / ....ARCTIC / ....snapshots /...version_nums/...versions, and during the mongodump process, it dumps 2 files for each collection, so a total of 10 files for each library.

I successfully manage to mongorestore those 10 files into a seperate PC. ie I can do things like below (ie I can do things like print(Arctic('localhost')['attribution_data_europe'].list_symbols())

image

Now, each symbol in my library represents a pandas dataframe (actually they are saved as Blob, since they contain Objects), its around 5000rows x 2000 columns. The issue is if i read it in the new PC, eg "Arctic('localhost')['attribution_europe_data'].read('20220913').data" in Spyder, it will freeze, and eventually "restarting kernel...."

image

It shouldn't be a memory issue reading that dataframe, as I generated a similar size dataframe randomly in the same PC and it is ok.

As a test, I use the same mongodump and mongorestore method on a smaller / simpler library, of which the library consists of a very simple symbol of a dictionary of {'hi':1}. And the new PC (where I restore it) is is able to read this library and this symbol without any issue. Similarly I use the same method on pure dataframe, as opposed to Blob, it works as well!

So do you think during the mongodump and mongorestore process, it corrupts Blob object?

Also what you guys normally use to transfer arctic libraries from one PC to another? surely there is a simplier way than mongodump and mongorestore?

============== Just to update on more investigations:

1) if the symbol is a dataframe (that is NOT saved as a blob), it works 2) if the symbol is a dict say {'hi':1}, it works 3) if the symbol is a blob, it DOES NOT work (ie it will have trouble reading that symbol from the restored library in the new PC) 4) if the symbol is a dict wrapped around a pure dataframe, eg {'hi' : pd.DataFrame(np.random.rand(2,2))}, then it works 5) if the symbol is a dict wrapper around a blob, DOES NOT WORK, eg {'hi': some_blob}.

I have included what it looks like in the old PC, and what error it throws up in the new PC if the symbol is a dict wraps around a blob

(old PC) image

(new PC) image