Open KonstantinLitvin opened 4 years ago
Have you found a clever way of doing this? I've been thinking about this same problem since most of the time we are not refreshing the entire series of data, and mainly updating from last update until today. And maybe some symbols don't need updating altogether.
I would think that you need to save the last updated date in the metadata.
a chunksize of 1 year seems like a bad idea unless you frequently are reading/writing data of that size
I think @KonstantinLitvin issue is similar to the one in #610
a chunksize of 1 year seems like a bad idea unless you frequently are reading/writing data of that size
Yes, I usually read 1-3 years of daily (weekly) data and reading with one year chunk size is quite fast in comparison with 10 year / 1 month chunk size
Have you found a clever way of doing this? I've been thinking about this same problem since most of the time we are not refreshing the entire series of data, and mainly updating from last update until today. And maybe some symbols don't need updating altogether.
I would think that you need to save the last updated date in the metadata.
Yes, I use metadata for this purpose:
def append(self, symbol, data_frame, metadata=None):
metadata = {} if metadata is None else metadata
metadata.update(self.read_metadata(symbol))
last_index = metadata.get('last_index')
if last_index is None:
last_index = self.get_last_index(symbol)
overlaps = False
if last_index in data_frame.index:
data_frame = data_frame.loc[last_index:]
overlaps = True
if not data_frame.empty:
if overlaps:
data_frame = data_frame.iloc[1:]
if data_frame.empty:
logger.info(f"no new data")
return
len_before_update = self.get_length(symbol)
len_chunk = len(data_frame)
self.library.append(symbol, data_frame)
len_after_update = self.get_length(symbol)
assert len_before_update + len_chunk == len_after_update
logger.info(f"{len_chunk} rows were updated")
metadata.update({'last_index': self._get_last_index(data_frame)})
self.write_metadata(symbol, metadata)
else:
logger.info(f"no new data")
if self.duplicates_test(symbol):
logger.warning(f'found duplicates; library: {self.library_name}, symbol: {symbol}')
Is there any way to get index of the last element in chunckstore except reading last whole chunk. I'd like to implement functionality to append new data to chunckstore without producing copies of the same elements. Basically I have a new data_frame which overlaps with the data_frame in chucksotre and I need to cut it to make it starts from the end of df in chunkstore. I've tried to write metadata with last date_time_index but maybe there is more elegant way to do that? I thought also about using
update(...)
instead ofappend(...)
but I don't know if its good idea to rewrite whole '1Y' chunck because I need to add around one week of data.