kszucs / pandahouse

Pandas interface for Clickhouse database
MIT License
228 stars 70 forks source link

DateTime64(ns) KeyError, when ns is set in table definition #28

Open Lytmbot opened 4 years ago

Lytmbot commented 4 years ago

Hey, thanks for your great work, this module has been really helpful!

using clickhouse-server version: 20.3.8.53

I have a small problem with DateTime64(ns) where the size of ns has been explicitly set in the table definition.

With a clickhouse table as follows:

 CREATE TABLE db_name.tbl_name (
    Timestamp DateTime64(6) CODEC(Delta(8), LZ4),
    SomeData Float32 CODEC(LZ4)
) 
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(Timestamp)
ORDER BY intHash64(toInt64(Timestamp)) 
SAMPLE BY intHash64(toInt64(Timestamp))

A query:

query = "SELECT DISTINCT * FROM db_name.tbl_name"
connection = {
    'host': server_name,
    'database': db_name,
    'user': 'default'
}    
pandahouse.read_clickhouse(query=query, index=False, connection=connection)

Results in:

pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

pandahouse/convert.py in to_dataframe(lines, **kwargs)
     65     dtypes, parse_dates, converters = {}, [], {}
     66     for name, chtype in zip(names, types):
---> 67         dtype = CH2PD[chtype]
     68         if dtype == 'object':
     69             converters[name] = decode_escapes

KeyError: 'DateTime64(6)

If I understand correctly, the mapping defined earlier in the file convert.py

MAPPING = {'object': 'String',
           'uint64': 'UInt64',
           'uint32': 'UInt32',
           'uint16': 'UInt16',
           'uint8': 'UInt8',
           'float64': 'Float64',
           'float32': 'Float32',
           'int64': 'Int64',
           'int32': 'Int32',
           'int16': 'Int16',
           'int8': 'Int8',
           'datetime64[D]': 'Date',
           'datetime64[ns]': 'DateTime'}

does not cover the DateTime64(6) case. Or by extension any other DateTime(ns) case?

I would be happy to contribute a solution with a little guidance.

Thanks

Lytmbot commented 4 years ago

not sure if this is acceptable, but a solution could be as simple as:

def to_dataframe(lines, **kwargs):
    names = lines.readline().decode('utf-8').strip().split('\t')
    types = lines.readline().decode('utf-8').strip().split('\t')

    dtypes, parse_dates, converters = {}, [], {}
    for name, chtype in zip(names, types):

        if chtype.startswith('DateTime64'):
            precs = int(chtype.replace('DateTime64(', '').replace(')', ''))
            chtype = 'DateTime'

        dtype = CH2PD[chtype]
        if dtype == 'object':
            converters[name] = decode_escapes
        elif dtype.startswith('datetime'):
            parse_dates.append(name)
        else:
            dtypes[name] = dtype

    return pd.read_table(lines, header=None, names=names, dtype=dtypes,
                         parse_dates=parse_dates, converters=converters,
                         na_values=set(), keep_default_na=False, **kwargs)
kszucs commented 4 years ago

Thanks for the bug report and the patch.

I'm trying to allocate some time to create a new release in the upcoming week.

kdkavanagh commented 3 years ago

Hi @kszucs have you had a chance to evaluate this? This issue is preventing us from meaningfully using pandahouse unfortunately. If @Lytmbot's proposed soln is agreeable, I'd be happy to submit a PR provided you can cut another release.

kszucs commented 3 years ago

Could you please submit a PR including unittests?

My mid-term plan is to use the newly added arrow and parquet clickhouse output formats, but their type support is incomplete so far.