Open leprechaunt33 opened 1 year ago
Additionally, while it seems unlikely that the data source is the primary issue, I include below code used to join the tables together in case replicating the issue proves problematic.
seenv = dict()
def seenid(x):
if x in seenv:
return False
else:
seenv[x] = 1
return True
self.update_status("Getting unique ids from VAERSVAX table")
vaxdf2 = self.df['vax'][self.df['vax'].apply(seenid, arguments=[self.df['vax'].VAERS_ID])]
self.update_progress(400)
self.update_status("Joining with DATA frame...")
dfdata = self.df['data'].join(vaxdf2, on='VAERS_ID', how='left', allow_duplication=True)
(This removes duplicates on the key column which appear in the second table to ensure the joined result is sane. The duplicate data is seemingly an artifact which does not add to the dataset)
Description I am working with a copy of the VAERS dataset, converted from CSV to HDF5 with encoding latin1.
Data sanity is not guaranteed in our use case, specifically missing values may exist in a column being queried (which are masked by default by VAEX's use of Arrow/ChunkedArray interface under the hood)
My program makes extensive use of vaex expressions. In the process of testing I have run into several inconsistencies that do not seem to be explained in the documentation or cursory review of the code.
To work around this, I use str_contains with the negation operator, which does not result in the boolean operation error. However, this is inefficient and I would like to be able to use the inequality operator rather than making assumptions about the state of the data now or in the future.
Additionally, as several users have pointed out, the bitwise operators are used for or/and (and use of logical or/and yields an error relating to the AST tree). However, these do not work as advertised and seem to return inconsistent results. The image below shows evaluation of the result set size for two traitlets and their combination using or. It would be expected given the legacy column ER_VISIT has just over 194K rows, this would be the minimum record count returned. Instead, it returns 181K records. In actuality, the two columns data are mutually exclusive, so the correct answer returned by interacting the two should be the sum of their individual outcomes below.
To work around this I use two methods:
I was unable to find any indication of whether the inequality issue above is fixed by searching the issues, and installing an updated version on our python 3.9 is non trivial so I'd like to determine whether either of these have been encountered/resolved previously.
I do not believe the issue is data-set specific, although the error message seems to suggest it relates to the missing data. However, filling missing values on the column in question does not resolve the issue, and I am concerned I may run into this issue on future datasets.
Software information {'vaex-core': '4.14.0', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.13.0', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.2', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}
EDIT Adding stack trace from the null_count error: File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:74, in closure..operator(a, b)
72 result_data = op['op'](a_data, b_data)
73 if isinstance(a, NumpyDispatch):
---> 74 result_data = a.add_missing(result_data)
75 if isinstance(b, NumpyDispatch):
76 result_data = b.add_missing(result_data)
File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\numpy_dispatch.py:32, in NumpyDispatch.add_missing(self, ar) 29 # else: both numpy, handled by numpy 30 else: 31 if isinstance(self._array, vaex.array_types.supported_arrow_array_types): ---> 32 ar = combine_missing(ar, self._array) 33 # else: was numpy, handled by numpy 34 return ar
File ~\miniconda3\envs\py39\lib\site-packages\vaex\arrow\utils.py:50, in combine_missing(a, b) 48 def combine_missing(a, b): 49 # return a copy of a with missing values of a and b combined ---> 50 if a.null_count > 0 or b.null_count > 0: 51 a, b = vaex.arrow.convert.align(a, b) 52 if isinstance(a, pa.ChunkedArray): 53 # divide and conquer
AttributeError: 'bool' object has no attribute 'null_count'