man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

Fix missing column handling in numpy serializer. #860

Closed BaiBaiHi closed 4 years ago

BaiBaiHi commented 4 years ago

Bug in numpy serializer objify function causes issues in dataframe construction when a given column passed into the columns does not exist.

import numpy as np
import pandas as pd
from arctic.serialization.numpy_arrays import FrameConverter, FrametoArraySerializer

f = FrameConverter()
df = pd.DataFrame(data={'one': ['a', 'b', 'c', np.NaN]})
res = f.objify(f.docify(df), columns=['one', 'two'])
Traceback (most recent call last):
  File "/home/abai/pyenvs/arctic/lib/python3.6/site-packages/ipython-7.7.0-py3.6.egg/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-9e264fac0a78>", line 1, in <module>
    res = f.objify(f.docify(df), columns=['one', 'two'])
  File "/home/abai/code/arctic/arctic/serialization/numpy_arrays.py", line 166, in objify
    return pd.DataFrame(data, columns=cols, copy=True)[cols]
  File "/home/abai/pyenvs/arctic/lib/python3.6/site-packages/pandas-0.22.0+ahl1-py3.6-linux-x86_64.egg/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/home/abai/pyenvs/arctic/lib/python3.6/site-packages/pandas-0.22.0+ahl1-py3.6-linux-x86_64.egg/pandas/core/frame.py", line 419, in _init_dict
    extract_index(list(data.values()))
  File "/home/abai/pyenvs/arctic/lib/python3.6/site-packages/pandas-0.22.0+ahl1-py3.6-linux-x86_64.egg/pandas/core/frame.py", line 6218, in extract_index
    raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length

It seems like the intended behavior is to output a column of NaNs in that case so removing the NaN from the list will allow the NaN to broadcast in the dataframe construction.

T-Santos commented 4 years ago

@bmoscon Any thoughts on this proposed fix? We've seen this come up a few times now where the schema changes over time and queries for historical dates blow up when lib.read is executed with a list of columns and one or more of those columns didn't exist historically but do in more recent dates.