Description
Calling join() on a Vaex DataFrame where the join column contains a NaN value results in an IndexError some of the time, but not consistently. The following minimal example results in an error about 75% of the time on my machine:
Traceback (most recent call last):
File "/Users/ekl/code/sandbox/join_bug_minimal.py", line 20, in <module>
print(df_join.to_pandas_df())
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3336, in to_pandas_df
return create_pdf(self.to_dict(column_names=column_names, selection=selection, parallel=parallel, array_type=array_type))
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3251, in to_dict
return dict(list(zip(column_names, [array_types.convert(chunk, array_type) for chunk in self.evaluate(column_names, selection=selection, parallel=parallel)])))
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 3090, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size, progress=progress)
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataframe.py", line 6441, in _evaluate_implementation
arrays[expression] = arrays[expression][0:end-start] # materialize fancy columns (lazy, indexed)
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 582, in __getitem__
for chunk_start, chunk_end, chunks in ds.chunk_iterator([self.name]):
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 913, in chunk_iterator
yield from self._default_chunk_iterator(self._columns, columns, chunk_size, reverse=reverse)
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 508, in _default_chunk_iterator
yield i1, i2, reader()
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 499, in reader
chunks = {k: array_map[k][i1:i2] for k in columns}
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/dataset.py", line 499, in <dictcomp>
chunks = {k: array_map[k][i1:i2] for k in columns}
File "/Users/ekl/opt/miniconda3/envs/sandbox/lib/python3.9/site-packages/vaex/column.py", line 382, in __getitem__
ar = ar_unfiltered[take_indices]
IndexError: index -110 is out of bounds for axis 0 with size 4
(The index referenced in the last line is different with each run of the code)
Software information
Vaex version (import vaex; vaex.__version__): 4.9.1
Vaex was installed via: pip (but within a miniconda environment)
Python version: 3.9.0
OS: Mac OS 11.2.2
Additional information
Doing the join in the reverse direction doesn't cause an error: df_join = df_2.join(df_1, on="id") is fine
Many this for opening this issue, we heard about join issues before that gave similar stack traces, but could never reproduce it before, thanks to this issue we now can fix it (in #2079)!
Description Calling
join()
on a Vaex DataFrame where the join column contains aNaN
value results in an IndexError some of the time, but not consistently. The following minimal example results in an error about 75% of the time on my machine:Resulting stack trace:
(The index referenced in the last line is different with each run of the code)
Software information
import vaex; vaex.__version__)
: 4.9.1Additional information
df_join = df_2.join(df_1, on="id")
is fine