Closed alanbogossian closed 5 years ago
TickStore is probably not what you want if you are storing strings in your dataset. VersionStore (which is the default) or ChunkStore should be more suitable. I can take a further look at your data later.
Thanks for your reply. Please let me know if you also have the MemoryError and if you find anything wrong with the data.
On 5 Aug 2019, at 17:31, shashank khare notifications@github.com wrote:
TickStore is probably not what you want if you are storing strings in your dataset. VersionStore (which is the default) or ChunkStore should be more suitable. I can take a further look at your data later.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi @shashank88
Just to give a bit more background: I am recording tick by tick market data and saving to arctic TickStore every minute.
Then I read (issue 301) that it is possible to achieve some compression by reading back the data for the entire day and saving them again into Arctic.
This is what I am trying to do here and I am getting the MemoryError on that last step. I would like to confirm whether ChunkStore and VersionStore are appropriate for that.
Also it would be great if you could check why we are having this memory issue. I've tried to save slice of my data frame but it seems that the error is not consistently happening on the same row but at random. I was not able to isolate the row that was causing the issue.
Thanks!
Hi @shashank88,
I've tried to store using VERSION_STORE (reading all minute input and saving at once for the whole day) and the disk size is actually worst than TICK_STORE with input saved every minute.
I've then tried to store using CHUNK_STORE and this seems to be even worse: either no disk saving on some data (that I have not shared here) and the same MemoryError on the data shared in first post, although the message this time is slightly different:
File "/home/alan/test.py", line 68, in __init__
self.store_lib_chunk_symb.write(symbol_, df)
File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/chunkstore/chunkstore.py", line 357, in write
data = self.serializer.serialize(record)
File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 188, in serialize
ret = self.converter.docify(df)
File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 124, in docify
raise e
File "/home/alan/env/py36/lib/python3.6/site-packages/arctic/serialization/numpy_arrays.py", line 119, in docify
arrays.append(arr.tostring())
MemoryError
You don’t have enough memory to do the operation. An entire day of data is likely very large. Pandas is very memory intensive
Hi @bmoscon, thanks for your reply
The test I've done (data provided in first message) were actually for just two hours of recording.
Also saving to VERSION_STORE has no issue. Also I've had no issue when saving to CSV. The data having issue is a data frame of less than 60,000 rows. The values are large strings though (json files): these are raw messages coming from the exchanges.
if you want to store that in arctic you should probably parse the dictionary and store the data in columns. strings can be problematic, especially very large ones like you have in here. If you look at the code, version store operates in a wholly different manner than tickstore and chunkstore, so its not surprising that one works and they dont.
Thanks for your reply.
Candid question (I'm very novice in this area): why is storing a string problematic?
I thought about storing the parsed message, however there might be some data that I am currently ignoring from the raw message and that I might need in the future so I thought I should store the raw message. Also I thought if I found any issue with my parsing function, it will be safer to just store the raw data. So would you recommend me not to use arctic if I want to store raw messages? What should I use? Or should I absolutely avoid storing raw messages at all?
If you want to store raw data i'd recommend something else like redis or memcached
Thanks Bryant. We want to store historical data over several years, so we should probably not use redis or memcached?
By the way, have you found any issue with the data I tried to saved? It is still not clear to me why we got this error message in the first place. I understand your comment about the fact that we should not store raw messages into arctic, but I would still be interested to know what caused the error.
it works for me
In [1]: import arctic
In [2]: import pandas as pd
In [3]: df = pd.read_csv('tick_store_bitflyer_FX_BTC_JPY_lightning_board.csv')
In [4]: df.index = pd.to_datetime(df.index, utc=True)
In [5]: a = arctic.Arctic('127.0.0.1')
In [7]: a.initialize_library('temp-test', arctic.TICK_STORE)
In [8]: lib = a['temp-test']
In [9]: lib.write('testdata', df)
UserWarning: Discarding nonzero nanoseconds in conversion
UserWarning: Discarding nonzero nanoseconds in conversion
bucket, initial_image = TickStore._pandas_to_bucket(x[i:i + self._chunk_size], symbol, initial_image)
NB treating all values as 'exists' - no longer sparse
FutureWarning: The 'convert_datetime64' parameter is deprecated and will be removed in a future version
recs = df.to_records(convert_datetime64=False)
FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
array = TickStore._ensure_supported_dtypes(recs[col])
In [10]: lib.read('testdata')
FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(dtype, int):
Out[10]:
Unnamed: 0 response
1969-12-31 19:00:00-05:00 2019-08-03 10:05:36.666000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:36.777000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:36.880000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.056000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.225000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.273000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.535000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.622000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.731000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:37.839000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.122000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.345000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.515000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.628000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.763000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.844000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:38.951000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.080000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.216000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.305000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.423000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.642000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:39.909000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.013000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.137000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.357000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.476000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.593000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.704000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 10:05:40.847000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
... ... ...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:55.696000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:55.809000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:55.988000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.056000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.179000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.272000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.384000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.499000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.646000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:56.826000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.055000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.128000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.235000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.368000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.509000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.587000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.711000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.798000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:57.923000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.032000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.143000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.254000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.407000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.495000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.598000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.734000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.855000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:58.958000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:59.066000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
1969-12-31 19:00:00-05:00 2019-08-03 12:14:59.185000+08:00 {"jsonrpc": "2.0", "method": "channelMessage",...
[54679 rows x 2 columns]
In [11]:
Closing this as it’s not reproducible
Arctic Version
1.79.2
Arctic Store
TickStore
Platform and version
Python 3.6.7, Linux Mint 19 Cinnamon 64-bit
Description of problem and/or code sample that reproduces the issue
Hi, I'm trying to save the following data: https://drive.google.com/file/d/1dWWBNvx6vjyNK4kjZTVL4-YM0fmWxT5b/view?usp=sharing
to TickStore, code: https://pastebin.com/jEqXxq2t
and getting a MemoryError, see the stack traces: https://pastebin.com/Uy4pYAfH
I'm quite new to arctic so I might be doing something wrong, and I would appreciate if you could guide me with this.
Side question: Considering the nature of my data (2 col made of a time stamp and long string/json file), what is the best way to store these using arctic?
Thanks, Alan