man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

MemoryError when saving a dataframe with large strings to TickStore #810

Closed alanbogossian closed 5 years ago

alanbogossian commented 5 years ago

Arctic Version

1.79.2

Arctic Store

TickStore

Platform and version

Python 3.6.7, Linux Mint 19 Cinnamon 64-bit

Description of problem and/or code sample that reproduces the issue

Hi, I'm trying to save the following data: https://drive.google.com/file/d/1dWWBNvx6vjyNK4kjZTVL4-YM0fmWxT5b/view?usp=sharing

to TickStore, code: https://pastebin.com/jEqXxq2t

and getting a MemoryError, see the stack traces: https://pastebin.com/Uy4pYAfH

I'm quite new to arctic so I might be doing something wrong, and I would appreciate if you could guide me with this.

Side question: Considering the nature of my data (2 col made of a time stamp and long string/json file), what is the best way to store these using arctic?

Thanks, Alan

shashank88 commented 5 years ago

TickStore is probably not what you want if you are storing strings in your dataset. VersionStore (which is the default) or ChunkStore should be more suitable. I can take a further look at your data later.

alanbogossian commented 5 years ago

Thanks for your reply. Please let me know if you also have the MemoryError and if you find anything wrong with the data.

On 5 Aug 2019, at 17:31, shashank khare notifications@github.com wrote:

TickStore is probably not what you want if you are storing strings in your dataset. VersionStore (which is the default) or ChunkStore should be more suitable. I can take a further look at your data later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alanbogossian commented 5 years ago

Hi @shashank88

Just to give a bit more background: I am recording tick by tick market data and saving to arctic TickStore every minute.

Then I read (issue 301) that it is possible to achieve some compression by reading back the data for the entire day and saving them again into Arctic.

This is what I am trying to do here and I am getting the MemoryError on that last step. I would like to confirm whether ChunkStore and VersionStore are appropriate for that.

Also it would be great if you could check why we are having this memory issue. I've tried to save slice of my data frame but it seems that the error is not consistently happening on the same row but at random. I was not able to isolate the row that was causing the issue.

Thanks!

alanbogossian commented 5 years ago

Hi @shashank88,

I've tried to store using VERSION_STORE (reading all minute input and saving at once for the whole day) and the disk size is actually worst than TICK_STORE with input saved every minute.

I've then tried to store using CHUNK_STORE and this seems to be even worse: either no disk saving on some data (that I have not shared here) and the same MemoryError on the data shared in first post, although the message this time is slightly different:

  File "/home/alan/test.py", line 68, in __init__
    self.store_lib_chunk_symb.write(symbol_, df)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/chunkstore/chunkstore.py", line 357, in write
    data = self.serializer.serialize(record)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 188, in serialize
    ret = self.converter.docify(df)
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 124, in docify
    raise e
  File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 119, in docify
    arrays.append(arr.tostring())
MemoryError
bmoscon commented 5 years ago

You don’t have enough memory to do the operation. An entire day of data is likely very large. Pandas is very memory intensive

alanbogossian commented 5 years ago

Hi @bmoscon, thanks for your reply

The test I've done (data provided in first message) were actually for just two hours of recording.

Also saving to VERSION_STORE has no issue. Also I've had no issue when saving to CSV. The data having issue is a data frame of less than 60,000 rows. The values are large strings though (json files): these are raw messages coming from the exchanges.

bmoscon commented 5 years ago

if you want to store that in arctic you should probably parse the dictionary and store the data in columns. strings can be problematic, especially very large ones like you have in here. If you look at the code, version store operates in a wholly different manner than tickstore and chunkstore, so its not surprising that one works and they dont.

alanbogossian commented 5 years ago

Thanks for your reply.

Candid question (I'm very novice in this area): why is storing a string problematic?

I thought about storing the parsed message, however there might be some data that I am currently ignoring from the raw message and that I might need in the future so I thought I should store the raw message. Also I thought if I found any issue with my parsing function, it will be safer to just store the raw data. So would you recommend me not to use arctic if I want to store raw messages? What should I use? Or should I absolutely avoid storing raw messages at all?

bmoscon commented 5 years ago

If you want to store raw data i'd recommend something else like redis or memcached

alanbogossian commented 5 years ago

Thanks Bryant. We want to store historical data over several years, so we should probably not use redis or memcached?

By the way, have you found any issue with the data I tried to saved? It is still not clear to me why we got this error message in the first place. I understand your comment about the fact that we should not store raw messages into arctic, but I would still be interested to know what caused the error.

bmoscon commented 5 years ago

it works for me


In [1]: import arctic

In [2]: import pandas as pd

In [3]: df = pd.read_csv('tick_store_bitflyer_FX_BTC_JPY_lightning_board.csv')

In [4]: df.index = pd.to_datetime(df.index, utc=True)

In [5]: a = arctic.Arctic('127.0.0.1')

In [7]: a.initialize_library('temp-test', arctic.TICK_STORE)

In [8]: lib = a['temp-test']

In [9]: lib.write('testdata', df)
UserWarning: Discarding nonzero nanoseconds in conversion
UserWarning: Discarding nonzero nanoseconds in conversion
  bucket, initial_image = TickStore._pandas_to_bucket(x[i:i + self._chunk_size], symbol, initial_image)
NB treating all values as 'exists' - no longer sparse
FutureWarning: The 'convert_datetime64' parameter is deprecated and will be removed in a future version
  recs = df.to_records(convert_datetime64=False)
FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  array = TickStore._ensure_supported_dtypes(recs[col])

In [10]: lib.read('testdata')
 FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(dtype, int):
Out[10]:
                                                 Unnamed: 0                                           response
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.666000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.777000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:36.880000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.056000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.225000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.273000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.535000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.622000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.731000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:37.839000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.122000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.345000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.515000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.628000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.763000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.844000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:38.951000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.080000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.216000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.305000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.423000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.642000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:39.909000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.013000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.137000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.357000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.476000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.593000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.704000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 10:05:40.847000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
...                                                     ...                                                ...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.696000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.809000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:55.988000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.056000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.179000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.272000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.384000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.499000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.646000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:56.826000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.055000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.128000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.235000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.368000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.509000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.587000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.711000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.798000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:57.923000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.032000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.143000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.254000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.407000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.495000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.598000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.734000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.855000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:58.958000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:59.066000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00  2019-08-03 12:14:59.185000+08:00  {"jsonrpc": "2.0", "method": "channelMessage",...

[54679 rows x 2 columns]

In [11]:
bmoscon commented 5 years ago

Closing this as it’s not reproducible