[BUG-REPORT] AttributeError: 'bool' object has no attribute 'null_count' during df.filter on simple inequality

Description I am working with a copy of the VAERS dataset, converted from CSV to HDF5 with encoding latin1.

Data sanity is not guaranteed in our use case, specifically missing values may exist in a column being queried (which are masked by default by VAEX's use of Arrow/ChunkedArray interface under the hood)

My program makes extensive use of vaex expressions. In the process of testing I have run into several inconsistencies that do not seem to be explained in the documentation or cursory review of the code.

The inequality operator used on some columns yielded a column not found error: KeyError: 'Unknown variables or column: "DIED != \'Y\'"'. Using the equality operator does not yield the same error. This occurs regardless of whether it is a joined table or non joined. An additional error is received 'bool' object has no attribute 'null_count'
The dataset that causes this error is the result of using open_many, ie it is combination of several hdf5 files. There is nothing special about the data in the column, the value is either missing or a string (in the case of DIED above, the string 'Y')

To work around this, I use str_contains with the negation operator, which does not result in the boolean operation error. However, this is inefficient and I would like to be able to use the inequality operator rather than making assumptions about the state of the data now or in the future.

Additionally, as several users have pointed out, the bitwise operators are used for or/and (and use of logical or/and yields an error relating to the AST tree). However, these do not work as advertised and seem to return inconsistent results. The image below shows evaluation of the result set size for two traitlets and their combination using or. It would be expected given the legacy column ER_VISIT has just over 194K rows, this would be the minimum record count returned. Instead, it returns 181K records. In actuality, the two columns data are mutually exclusive, so the correct answer returned by interacting the two should be the sum of their individual outcomes below.

To work around this I use two methods:

Iteratively apply filter() on traitlets known to work (eg simple equalities or regex search) with the relevant and/or mode, reorganizing the traitlets as required to preserve boolean logic
Evaluate traitlets and return the boolean result for all 2.3 million rows in the dataset, enumerate the result set and filter down to True values only, thus obtaining the row indexes, obtain the union (or intersection for and) between sets of indices, then using df.take to take the required indices. However, this is reasonably slow (4.6 seconds with list comprehension, and for loop unrolled will only shave half a second at most off this), which makes the output unusable in a GUI without threading the result then delayed-updating the GUI on the main thread

I was unable to find any indication of whether the inequality issue above is fixed by searching the issues, and installing an updated version on our python 3.9 is non trivial so I'd like to determine whether either of these have been encountered/resolved previously.

I do not believe the issue is data-set specific, although the error message seems to suggest it relates to the missing data. However, filling missing values on the column in question does not resolve the issue, and I am concerned I may run into this issue on future datasets.

Software information {'vaex-core': '4.14.0', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.13.0', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.2', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}

Vaex was installed via: Mamba in a conda environment
OS: Windows 10

EDIT Adding stack trace from the null_count error: File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:74, in closure..operator(a, b) 72 result_data = op['op'](a_data, b_data) 73 if isinstance(a, NumpyDispatch): ---> 74 result_data = a.add_missing(result_data) 75 if isinstance(b, NumpyDispatch): 76 result_data = b.add_missing(result_data)

File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:32, in NumpyDispatch.add_missing(self, ar) 29 # else: both numpy, handled by numpy 30 else: 31 if isinstance(self._array, vaex.array_types.supported_arrow_array_types): ---> 32 ar = combine_missing(ar, self._array) 33 # else: was numpy, handled by numpy 34 return ar

File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\utils.py:50, in combine_missing(a, b) 48 def combine_missing(a, b): 49 # return a copy of a with missing values of a and b combined ---> 50 if a.null_count > 0 or b.null_count > 0: 51 a, b = vaex.arrow.convert.align(a, b) 52 if isinstance(a, pa.ChunkedArray): 53 # divide and conquer

AttributeError: 'bool' object has no attribute 'null_count'

vaexio / vaex

[BUG-REPORT] AttributeError: 'bool' object has no attribute 'null_count' during df.filter on simple inequality #2327