vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Export to parquet fails when first row of virtual column is None #2260

Open vladmihaisima opened 1 year ago

vladmihaisima commented 1 year ago

Description When trying to export a vaex dataframe that contains a column that on the first row has None, the export fails. This does not happen if the None is on other rows. A workaround for the fail is to sort the dataframe based on the column (which places the None values at the end).

import pandas as pd
import vaex as vx
df = vx.from_pandas(pd.DataFrame(data={'col1':['chr1','chr2'],'col2':[3,4]}))
# WORKS (first row a string)
df['REF'] = df.apply(lambda c, p: "c" if p == 3 else None, arguments=[df.col1, df.col2])
df.export('works.parquet')
# FAILS (first row a None)
df['REF'] = df.apply(lambda c, p: "c" if p == 4 else None, arguments=[df.col1, df.col2])
df.export('fails.parquet')

Exception reported:

  File "<stdin>", line 1, in <module>
  File "/home/vsima/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6731, in export
    self.export_parquet(path, progress=progress, parallel=parallel, chunk_size=chunk_size, fs_options=fs_options, fs=fs)
  File "/home/vsima/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6817, in export_parquet
    self.export_arrow(writer, progress=progress, chunk_size=chunk_size, parallel=parallel, reduce_large=True)
  File "/home/vsima/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6775, in export_arrow
    write(to)
  File "/home/vsima/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6758, in write
    writer.write_table(table)
  File "/home/vsima/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1052, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
col1: string
col2: int64
REF: string vs.
file:
col1: string
col2: int64
REF: null

Software information

Additional information This does not happen with csv, but happens with hdf5 (although error is different for hdf5).

JovanVeljanoski commented 1 year ago

Hi,

Thanks for the report. i can not classify this as a bug simply because what you are trying to do is not officially supported. In essence, if you do df.dtypes after you create the REF column for the example that fails, you will see you are creating a dtype object which is not supported. Vaex only supports a single type per expression / column.

It might be confusing why an object type is not created the 1st time around.. and this is (probably not sure) due to vaex seeing what earlier elements of the expression are, and finds the common type. It is more of a safeguard rather then something you should rely on. Kind of a drawback of using apply in this case.

I kind of remember in the past we had an idea to provide option for users to specify the expected dtype of the output, but for whatever reason it did not get implemented.

Does this make sense?

Other random bits:

vladmihaisima commented 1 year ago

Thanks for the quick answer. Some observations:

I think such behaviour is not ideal, especially because it depends on the sort order of input data (unpredictable a bit). I seen this first time only in one data set when executing the code on hundreds of input files, which never happened in testing (because in the test data sets never had a first row that had a None).

Regarding the random bits: I picked the first example to load from pandas, we do not load from any in memory format (would mostly defeat the purpose of lazy processing), but hope next time when I need to do a unit test I will remember to use from_dict.

JovanVeljanoski commented 1 year ago

It is a bit of a weird case. In practice in both cases you get dtype object do you agree with this, since in both cases one is mixing types. In the first case vaex can infer / guess that the dtype should likely be string so it goes there. In the other one does not (which i see is a sort of an edge case). I do think it should be handled better in any case.

But this is the danger of using apply. One can execute arbitrary code that can produce anything. So one of the (sort of strict I would say) guidelines is that the output should be of the same type. So at the moment, that responsibility is put on the user.

Yeah we looked into it, but was never implemented. The astype(str) crashing is unfortunate but not that surprising since.. objects are not officially supported and anything you do with them is considered out of scope. Honestly, I would prefer if things always failed in these cases..

But I think we do need some sort of an improvement here, so i want to keep this issue open until we can resolve it properly.

Can you maybe describe the intent of the original example code? Do you want to mark certain samples as missing values or nans? Or something completely different?

vladmihaisima commented 1 year ago

For the example code I get different dtypes for the REF: string in one case and object in the other (I think this has to do with the issue). So, for the code:

import pandas as pd
import vaex as vx
df = vx.from_pandas(pd.DataFrame(data={'col1':['chr1','chr2'],'col2':[3,4]}))
df['REF'] = df.apply(lambda c, p: "c" if p == 3 else None, arguments=[df.col1, df.col2])
print(df.dtypes)
df['REF'] = df.apply(lambda c, p: "c" if p == 4 else None, arguments=[df.col1, df.col2])
print(df.dtypes)

I get:

col1    string
col2     int64
REF     string
dtype: object
col1    string
col2     int64
REF     object
dtype: object

According to https://vaex.readthedocs.io/en/latest/guides/missing_or_invalid_data.html there are 3 types of missing/invalid. As the desired type of the column is string, NaN does not make any sense, but I would still like to be able to use "missing".

The original case is that additional string columns are computed, but that is not possible for all rows.

So can the user "when using apply specify that some values are missing/not available?". In python the most natural would be to use None. In fact from_dict interprets None as missing (as I think it is natural). Example:

df = vx.from_dict({'col1':[None,'chr2']})
df['REF'] = df.apply(lambda c: c, arguments=[df.col1])
df

I get:

  #  col1    REF
  0  --      None
  1  chr2    chr2

(and also dtype is correctly computed for col1 as string) In conclusion I think apply should behave like from_dict for None values.

JovanVeljanoski commented 1 year ago

Do you agree that in both cases you should get a dtype object? Since in both cases you are mixing strings and None (None is it's own type NoneType). So if there is any bug.. it is that you do get a string where it should be object instead. I think in some cases we tried to protect users and convert to string when possible or appropriate. But this does not cover everything.

Ok so right now, it is not possible create missing values with vaex. If you want to exclude some rows, better to do create a boolean expression and filter by that.

vladmihaisima commented 1 year ago

"You should get a dtype of object" => I understand now that this is the way it works now. Maybe a comment in the apply function return docstring (https://vaex.readthedocs.io/en/latest/api.html#vaex.dataframe.DataFrame.apply) should mention this expectation (the apply function should return only one of the supported types described at https://vaex.readthedocs.io/en/latest/guides/data_types.html#String-support-in-Vaex otherwise Object is used).

Just to be clear, I would prefer that "mixing strings and None" would (always) result in type string and missing values, like in the example at https://vaex.readthedocs.io/en/latest/guides/data_types.html#String-support-in-Vaex, where the dtype of column y is string.

My use case implies normalising some data sets (some columns can be computed based on others) and dumping them for further processing, so filtering is not fitting (when chaining virtual columns with many other operations I stumbled across other bugs, plus some of the computations are compute intensive so would prefer to do them only once). I did some workarounds and it works for my case for now.

JovanVeljanoski commented 1 year ago

Ok, let's see if we can improve this, at least the docs if not the overall behavior. Thanks for the report!

Also, maybe this deserves a separate thread as not not take the discussion off-topic, but chaining of vaex virtual columns is not recommend. You should assign a new virtual column to a dataframe, and then do the next operation on it, and so on. @maartenbreddels can explain the reason for this far better than I. If issues still persist, we'd like to know about them :)

vladmihaisima commented 1 year ago

By chaining did not mean like function chaining (sorry for the confusion), but what you describe (creating more virtual columns and apply other operations and repeating that). I will try to isolate the more complex case as well and fill another bug... Thanks for the work on this great framework, makes life easier!