vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] AttributeError: 'bool' object has no attribute 'null_count' during df.filter on simple inequality #2327

Open leprechaunt33 opened 1 year ago

leprechaunt33 commented 1 year ago

Description I am working with a copy of the VAERS dataset, converted from CSV to HDF5 with encoding latin1.

Data sanity is not guaranteed in our use case, specifically missing values may exist in a column being queried (which are masked by default by VAEX's use of Arrow/ChunkedArray interface under the hood)

My program makes extensive use of vaex expressions. In the process of testing I have run into several inconsistencies that do not seem to be explained in the documentation or cursory review of the code.

  1. The inequality operator used on some columns yielded a column not found error: KeyError: 'Unknown variables or column: "DIED != \'Y\'"'. Using the equality operator does not yield the same error. This occurs regardless of whether it is a joined table or non joined. An additional error is received 'bool' object has no attribute 'null_count'
  2. The dataset that causes this error is the result of using open_many, ie it is combination of several hdf5 files. There is nothing special about the data in the column, the value is either missing or a string (in the case of DIED above, the string 'Y')

To work around this, I use str_contains with the negation operator, which does not result in the boolean operation error. However, this is inefficient and I would like to be able to use the inequality operator rather than making assumptions about the state of the data now or in the future.

Additionally, as several users have pointed out, the bitwise operators are used for or/and (and use of logical or/and yields an error relating to the AST tree). However, these do not work as advertised and seem to return inconsistent results. The image below shows evaluation of the result set size for two traitlets and their combination using or. It would be expected given the legacy column ER_VISIT has just over 194K rows, this would be the minimum record count returned. Instead, it returns 181K records. In actuality, the two columns data are mutually exclusive, so the correct answer returned by interacting the two should be the sum of their individual outcomes below.

image

To work around this I use two methods:

I was unable to find any indication of whether the inequality issue above is fixed by searching the issues, and installing an updated version on our python 3.9 is non trivial so I'd like to determine whether either of these have been encountered/resolved previously.

I do not believe the issue is data-set specific, although the error message seems to suggest it relates to the missing data. However, filling missing values on the column in question does not resolve the issue, and I am concerned I may run into this issue on future datasets.

Software information {'vaex-core': '4.14.0', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.13.0', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.2', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}

EDIT Adding stack trace from the null_count error: File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:74, in closure..operator(a, b) 72 result_data = op['op'](a_data, b_data) 73 if isinstance(a, NumpyDispatch): ---> 74 result_data = a.add_missing(result_data) 75 if isinstance(b, NumpyDispatch): 76 result_data = b.add_missing(result_data)

File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:32, in NumpyDispatch.add_missing(self, ar) 29 # else: both numpy, handled by numpy 30 else: 31 if isinstance(self._array, vaex.array_types.supported_arrow_array_types): ---> 32 ar = combine_missing(ar, self._array) 33 # else: was numpy, handled by numpy 34 return ar

File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\utils.py:50, in combine_missing(a, b) 48 def combine_missing(a, b): 49 # return a copy of a with missing values of a and b combined ---> 50 if a.null_count > 0 or b.null_count > 0: 51 a, b = vaex.arrow.convert.align(a, b) 52 if isinstance(a, pa.ChunkedArray): 53 # divide and conquer

AttributeError: 'bool' object has no attribute 'null_count'

leprechaunt33 commented 1 year ago

Additionally, while it seems unlikely that the data source is the primary issue, I include below code used to join the tables together in case replicating the issue proves problematic.

    seenv = dict()

    def seenid(x):
        if x in seenv:
            return False
        else:
            seenv[x] = 1
            return True

    self.update_status("Getting unique ids from VAERSVAX table")
    vaxdf2 = self.df['vax'][self.df['vax'].apply(seenid, arguments=[self.df['vax'].VAERS_ID])]
    self.update_progress(400)
    self.update_status("Joining with DATA frame...")
    dfdata = self.df['data'].join(vaxdf2, on='VAERS_ID', how='left', allow_duplication=True)

(This removes duplicates on the key column which appear in the second table to ensure the joined result is sane. The duplicate data is seemingly an artifact which does not add to the dataset)