man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.55k stars 94 forks source link

Migrating TickStore to ArcticDB #460

Closed markeasec closed 1 year ago

markeasec commented 1 year ago

I'm raising this issue here at the suggestion of @mehertz since the arctic repo is not actively monitored / maintained.

Arctic Version

1.80.0

Arctic Store

TickStore

Platform and version

RHEL 7

Description of problem and/or code sample that reproduces the issue

Hello, I have a collection of a few TB of tick data in an arctic tickstore that I want to migrate to the new ArcticDB.

I believe the only publicly available way to do this is to read all the data out from tickstore and write it to ArcticDB, is this correct?

If so, I was wondering if there is a recommended approach for that. The only way I could think of was to read it in time chunks, say 1 hour at a time, and then write it to arcticDB. Is there a way to instead iterate over the underlying mongodb documents, read 1 at a time, and write the resulting dataframe to arcticDB? I looked through tickstore.py and couldn't see any methods that would support that but maybe I missed something or maybe one of the existing methods could be modified to accomplish this?

My reason for preferring a documents approach vs a time chunks approach would just be to: A - have deterministic data sizes in the read/write process (no risk of running out of memory during the job) B - seems cleaner to me, I worry about ticks at the very edge of the time window getting read twice, written twice and thus duplicated. Thanks in advance for any help you can provide.

mehertz commented 1 year ago

Thanks for raising @markeasec. As mentioned in the Arctic repository, we don't currently have a good way to do this that we can offer but your thoughts as to why we should are very reasonable.

I've prioritised this - we'll update this ticket as to when we make progress but I can't offer a timeline right now so I wouldn't advise waiting for this functionality to be made available if you can avoid it.

markeasec commented 1 year ago

Thanks. Can you clarify if there is any danger of data in Arctic (not arcticDB) being read twice due to being 'at' the start/end of a window? Or is the left-hand side of a window always inclusive and the right-hand side always exclusive? If there's no danger of duplicating data with a read/write approach, i will probably just bite the bullet and spin up a huge box and do it that way.

qc00 commented 1 year ago

You will have to ask that on the https://github.com/man-group/arctic repo. Different teams maintain these two code bases.

AFAIK, you can use this DateRange type to specify whether each end is open/close.

markeasec commented 1 year ago

Thanks for the pointer about DateRange, I will look into that. I had actually originally opened it there and was instructed to raise it here instead.

DennyZen commented 7 months ago

@markeasec Hey mate what about your TickStore migration? was fine? thinking about migration also.. check it https://github.com/man-group/arctic/issues/1026