vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Joining on a column with NaN values gives IndexError sporadically #2077

Closed emilykl closed 2 years ago

emilykl commented 2 years ago

Description Calling join() on a Vaex DataFrame where the join column contains a NaN value results in an IndexError some of the time, but not consistently. The following minimal example results in an error about 75% of the time on my machine:

import vaex
import pandas as pd
import numpy as np

df_pandas_1 = pd.DataFrame({
    "id": ["a", "b", "c", np.nan],
    "count_1": [1, 2, 3, 4],
})
df_pandas_2 = pd.DataFrame({
    "id": ["a", "b", "c", "d"],
    "count_2": [5, 6, 7, 8],
})

df_1 = vaex.from_pandas(df_pandas_1)
df_2 = vaex.from_pandas(df_pandas_2)

df_join = df_1.join(df_2, on="id")

print(df_join.to_pandas_df())

Resulting stack trace:

Traceback (most recent call last):
  File "/Users/ekl/code/sandbox/join_bug_minimal.py", line 20, in <module>
    print(df_join.to_pandas_df())
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3336, in to_pandas_df
    return create_pdf(self.to_dict(column_names=column_names, selection=selection, parallel=parallel, array_type=array_type))
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3251, in to_dict
    return dict(list(zip(column_names, [array_types.convert(chunk, array_type) for chunk in self.evaluate(column_names, selection=selection, parallel=parallel)])))
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3090, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size, progress=progress)
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 6441, in _evaluate_implementation
    arrays[expression] = arrays[expression][0:end-start]  # materialize fancy columns (lazy, indexed)
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 582, in __getitem__
    for chunk_start, chunk_end, chunks in ds.chunk_iterator([self.name]):
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 913, in chunk_iterator
    yield from self._default_chunk_iterator(self._columns, columns, chunk_size, reverse=reverse)
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 508, in _default_chunk_iterator
    yield i1, i2, reader()
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 499, in reader
    chunks = {k: array_map[k][i1:i2] for k in columns}
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 499, in <dictcomp>
    chunks = {k: array_map[k][i1:i2] for k in columns}
  File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/column.py", line 382, in __getitem__
    ar = ar_unfiltered[take_indices]
IndexError: index -110 is out of bounds for axis 0 with size 4

(The index referenced in the last line is different with each run of the code)

Software information

Additional information

maartenbreddels commented 2 years ago

Many this for opening this issue, we heard about join issues before that gave similar stack traces, but could never reproduce it before, thanks to this issue we now can fix it (in #2079)!