vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

[BUG-REPORT] category_count gives wrong answers #2373

Open zincopper opened 1 year ago

zincopper commented 1 year ago
image

Also wrong answers for category_labels. These functions only consider the first chunk of ChunkedArray.

zincopper commented 1 year ago

Here is how I get the data in:

def skip_error(row):
    print('skip_error row:', row)
    return 'skip'

read_options = csv.ReadOptions(column_names=['room_id', 'uid', 'gift_id', 'yuchi_amt', 'dateline'])
parse_options = csv.ParseOptions(invalid_row_handler=skip_error)
convert_options = csv.ConvertOptions(include_missing_columns=True,
                                     auto_dict_encode=True, auto_dict_max_cardinality=800_000_000)

data = vaex.from_csv_arrow(file_path,
                           read_options = read_options, parse_options = parse_options, convert_options = convert_options)