vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] bug of the export_csv in dataframe.py and possible fix #1063

Closed astromsshin closed 3 years ago

astromsshin commented 4 years ago

Description

I find a bug in the export_csv function which is implemented in dataframe.py.

use_df.export_csv(outfn, header=False, quoting=None)

{
Traceback (most recent call last):
  File "./produce_lc_from_hdf5.py", line 13, in <module>
    use_df.export_csv(outfn, header=False, quoting=None)
  File "/extra_disk/work_dir/Package/miniconda3/envs/astro/lib/python3.7/site-packages/vaex/dataframe.py", line 5855, in export_csv
    chunk_pdf.to_csv(path_or_buf=path, mode=mode, header=header, index=False, **kwargs)
TypeError: to_csv() got multiple values for keyword argument 'header'
}

Software information

Additional information In dataframe.py,

{
           if i1 == 0:  # Only the 1st chunk should have a header and the rest will be appended
                mode = 'w'
                header = True
            else:
                mode = 'a'
                header = False

            chunk_pdf.to_csv(path_or_buf=path, mode=mode, header=header, index=False, **kwargs)
}

causes a problem. The code needs changes that check where kwargs has header and index. If kwargs has header or index, header and index values should be the values provided in kwargs.

astromsshin commented 4 years ago

The following in dataframe.py is what I suggest to fix this issue. I confirm that this change fixes the issue.

{
        dtypes = self[expressions].dtypes
        n_samples = len(self)

        if "header" in kwargs:
            user_header = True
        else:
            user_header = False

        if "index" in kwargs:
            user_index = True
        else:
            user_index = False

        for i1, i2, chunks in self.evaluate_iterator(expressions, chunk_size=chunk_size, selection=selection):
            progressbar( i1 / n_samples)
            chunk_dict = {col: values for col, values in zip(expressions, chunks)}
            chunk_pdf = pd.DataFrame(chunk_dict)

            if i1 == 0:  # Only the 1st chunk should have a header and the rest will be appended
                mode = 'w'
                header = True
            else:
                mode = 'a'
                header = False

            if user_header and user_index:
                chunk_pdf.to_csv(path_or_buf=path, mode=mode, **kwargs)
            elif user_header:
                chunk_pdf.to_csv(path_or_buf=path, mode=mode, index=False, **kwargs)
            elif user_index:
                chunk_pdf.to_csv(path_or_buf=path, mode=mode, header=header, **kwargs)
            else:
                chunk_pdf.to_csv(path_or_buf=path, mode=mode, header=header, index=False, **kwargs)
}
maartenbreddels commented 4 years ago

Thanks @astromsshin

@JovanVeljanoski what do you think? makes sense right?

dkipping commented 3 years ago

Hi @maartenbreddels and @JovanVeljanoski , did you already have a chance to decide whether this will be integrated? I am currently also facing this when trying to set index=False in the export_csv method.

Thanks!

JovanVeljanoski commented 3 years ago

Coming soon!