vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

[BUG-REPORT] vaex.concat(resolver="strict") incorrectly cares about hidden column mismatches #2310

Open NickCrews opened 1 year ago

NickCrews commented 1 year ago

This works (as expected):

df1 = vaex.from_arrays(x=[1, 2, 3])
df2 = vaex.from_arrays(x=[4, 5, 6])
vaex.concat([df1, df2], resolver="strict")

This also works (as expected):

df1 = vaex.from_arrays(x=[1, 2, 3])
df2 = vaex.from_arrays(x=[4, 5, 6])
df2["x"] = df2["x"] + 10
vaex.concat([df1, df2])

But if I make that strict, it fails:

df1 = vaex.from_arrays(x=[1, 2, 3])
df2 = vaex.from_arrays(x=[4, 5, 6])
df2["x"] = df2["x"] + 10
vaex.concat([df1, df2], resolver="strict")

Although both DFs have the same public columns of "x", df2 now has a hidden column of "__x", and vaex doesn't like that.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[56], line 4
      2 df2 = vaex.from_arrays(x=[4, 5, 6])
      3 df2["x"] = df2["x"] + 10
----> 4 vaex.concat([df1, df2], resolver="strict")

File ~/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.10/site-packages/vaex/__init__.py:827, in concat(dfs, resolver)
    822 '''Concatenate a list of DataFrames.
    823 
    824 :param resolver: How to resolve schema conflicts, see :meth:`DataFrame.concat`.
    825 '''
    826 df, *tail = dfs
--> 827 return df.concat(*tail, resolver=resolver)

File ~/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.10/site-packages/vaex/dataframe.py:6303, in DataFrameLocal.concat(self, resolver, *others)
   6301 first, *tail = dfs
   6302 # concatenate all datasets
-> 6303 dataset = first.dataset.concat(*[df.dataset for df in tail], resolver=resolver)
   6304 df_concat = vaex.dataframe.DataFrameLocal(dataset)
   6306 for name in list(first.virtual_columns.keys()):

File ~/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.10/site-packages/vaex/dataset.py:448, in Dataset.concat(self, resolver, *others)
    446     else:
    447         datasets.extend([other])
--> 448 return DatasetConcatenated(datasets, resolver=resolver)

File ~/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.10/site-packages/vaex/dataset.py:699, in DatasetConcatenated.__init__(self, datasets, resolver)
    697         r = set(datasets[0])
    698         diff = l ^ r
--> 699         raise NameError(f'Concatenating datasets with different names: {l} and {r} (difference: {diff})')
    700 self._schema = datasets[0].schema()
    701 self._shapes = datasets[0].shapes()

NameError: Concatenating datasets with different names: {'__x', 'x'} and {'x'} (difference: {'__x'})

I would expect "strict" mode to only care about the public columns matching.