vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.29k stars 589 forks source link

Converting CSV to HDF5 with vaex #775

Open Bchi1994 opened 4 years ago

Bchi1994 commented 4 years ago

Hello,

I am working with some large data and am using vaex to convert a csv file to and hdf5 file. Unfortunately, I am having issues with the code and was hoping someone could please help!

Here is the code:

import vaex as vs

df = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', convert=True, chunk_size=5_000_000)

Unfortunately, it is only outputting a 1KB hdf5 file that is empty. It should be closer to 1GB as that is the size of the CSV file. Any help would be greatly appreciated.

Thanks!

JovanVeljanoski commented 4 years ago

Hi @Bchi1994, thanks for reaching out.

I assume you let the process finish and when it is done you get no error message here right?

Could you please try to read just a single chunk of the csv file, to see if everything is ok?

df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)

If that works, can you and export it:

df_chunk.export('./test_chunk.hdf5')

And try to read that?

df = vaex.open('./test_chunk.hdf5')

Just to try and narrow down the problem.

(Any other ideas @byaminov )

Bchi1994 commented 4 years ago

Error:

TypeError: parser_f() got an unexpected keyword argument 'nrow'

JovanVeljanoski commented 4 years ago

try nrows

Bchi1994 commented 4 years ago

Was able to run with nrows. Export gave me this error:

Unable to create file (unable to open file: name = 'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02/test_chunk.hdf5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 302)

JovanVeljanoski commented 4 years ago

can you please be more explicit? Did any of the above lines work? Which is the line that raises the error?

Bchi1994 commented 4 years ago

df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)

Worked!

df_chunk.export('./test_chunk.hdf5') - resulted in error

JovanVeljanoski commented 4 years ago

ok cool!

So if possible, can you please send a screenshot of the data, and/or the output of df_chunk.dtypes?

Bchi1994 commented 4 years ago

See below. Can I email you a link to google drive? The data is quite large.

Bchi1994 commented 4 years ago

df_chunk.dtypes Out[3]: underlying_symbol <class 'str'> quote_datetime <class 'str'> root <class 'str'> expiration <class 'str'> strike float64 option_type <class 'str'> open float64 high float64 low float64 close float64 trade_volume int64 bid_size int64 bid float64 ask_size int64 ask float64 underlying_bid float64 underlying_ask float64 implied_underlying_price float64 active_underlying_price float64 implied_volatility float64 delta float64 gamma float64 theta float64 vega float64 rho float64 dtype: object

JovanVeljanoski commented 4 years ago

That is strange.. looks like nothing weird. If the data is not very confidential, is it possible for you to send us a small chunk of it?

something like this:

df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)

df_chunk.export_csv('testfile.csv')

And maybe attach that testfile.csv here? Or I can give you my email if you don't want to put it public on github.

Bchi1994 commented 4 years ago

Smaller batch attached.

CBOEtrimmed2.zip

JovanVeljanoski commented 4 years ago

Thanks. I'll look into it and get back to you.

Bchi1994 commented 4 years ago

Thank you Thank you! Sorry, it should be a csv file, not a txt file. I converted it by accident. Please change it to csv

Here is it in CSV. CBOEtrimmed2 (2).zip

JovanVeljanoski commented 4 years ago

Hi @Bchi1994

I can read / write the same you sent me without any problems.

Actually, looking at your code and the error message that appears, are you sure that your export path is valid?

The error states "No such file or directory" when you try to write the file, and your output path seems somewhat inconsistent with the path from which you read the data. Can you perhaps try with a much simpler path just in case?

Otherwise, in your console (where you run pip commands) execute: pip show vaex-hdf5 and tell us what is says?

Bchi1994 commented 4 years ago

Ok. So I changed the export to my desktop and it runs with no errors. However, the exported hdf5 test.zip file is tiny (attached).

import vaex as vs

df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrows=5_000_000)

df_chunk.export(r'C:\Users\Benjamin Heller\Desktop\test.hdf5')

Bchi1994 commented 4 years ago

I am using pycharm, not sure how to do the pip show vaex-hdf5

JovanVeljanoski commented 4 years ago

Yeah the file you sent is invalid for some reason.

So how did you install vaex? If you are using conda, on the conda promt/command line you can type: pip show vaex-hdf5

Seems like an installation issue to me..

Bchi1994 commented 4 years ago

I installed as a Pycharm package and added to my Project Interpreter. I am using Python 3.8 (no conda).

Same issue on 2 separate computers. Both windows 10

JovanVeljanoski commented 4 years ago

Ah in that case I have no idea. Maybe @byaminov @xdssio might be able to help. I've not used pycharm and unfortunately i have no access to a windows machine.

See if you can google and find out how to check versions of installed packages via pycharm, and see what version of vaex-core and vaex-hdf5 you have.

The sample of data you shared seems clean and it should be able to be converted without any problems..

Bchi1994 commented 4 years ago

Seems to work on my friend's mac. Not sure why Pycharm windows is having issues.

Bchi1994 commented 4 years ago

Resolved! - Vaex does not play nice with Python 3.8. Must use vaex on Python 3.7.

JovanVeljanoski commented 4 years ago

Ah indeed, you are right, on windows python3.8 is not fully supported yet.

maartenbreddels commented 4 years ago

It should be right? CI runs 38 as well. Seems like a bug, maybe also present on other os'es?

(from mobile phone)

On Sun, 31 May 2020, 12:48 Jovan Veljanoski, notifications@github.com wrote:

Ah indeed, you are right, on windows python3.8 is not fully supported yet.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/775#issuecomment-636454018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPFTM7IHHSV5A3UIPLRUIYXTANCNFSM4NPBMJGQ .

maartenbreddels commented 4 years ago

Jovan corrected me, indeed, CI is not yet running on windows with 3.8 https://github.com/vaexio/vaex/blob/d7c32e046dd3da6eaf773221bd74bdeed2127ab2/.github/workflows/pythonpackage.yml#L21

Keeping this open as a bug.

magicbyte-fe commented 3 years ago

Hi, trying more or less the same example on MacOS. python3 + vaex 3.0.0


ads_path_csv = "test-data/BIG_ADS_NEW.csv"
ads_path_hdf5 = "test-data/hdf5/BIG_ADS_NEW_*.hdf5"
hdf5_ads_dir = 'test-data/hdf5'

def read_big_ads_new_csv_and_save_hdf5():
    """
    Takes 20 seconds to save 10 chunks
    :return:
    """
    start_time = time.time()
    vaex_chunks = vaex.from_csv(ads_path_csv, copy_index=False, chunk_size=100_000, delimiter=';', header='infer')

    if os.path.exists(hdf5_ads_dir):
        os.remove(hdf5_ads_dir)

    if not os.path.exists(hdf5_ads_dir):
        os.mkdir(hdf5_ads_dir)

    for idx, chunk in enumerate(vaex_chunks):
        export_path = ads_path_hdf5.replace("*", str(idx))
        print(f"export chunk {idx} to {export_path}")
        print(chunk.head(1))
        chunk.export(export_path)

    stop_time = time.time()
    print(f"read_big_ads_new_csv_and_save_hdf5: {stop_time - start_time} seconds")

print(chunk.head(1)) looks ok, input CSV file contain 1M records, I see that 10 hdf5 files were created. Sample output:

export chunk 11 to test-data/hdf5/BIG_ADS_NEW_11.hdf5

Now I try to read it:


def read_big_ads_new_hdf5():
    start_time = time.time()
    print(f"Opening {ads_path_hdf5}")
    df = vaex.open(ads_path_hdf5)
    print(f"Done opening {ads_path_hdf5}")
    print(df.head(3))
    stop_time = time.time()
    print(f"read_big_ads_new_hdf5: {stop_time - start_time} seconds")

Output:

Opening test-data/hdf5/BIG_ADS_NEW_*.hdf5
Done opening test-data/hdf5/BIG_ADS_NEW_*.hdf5

And then:

Traceback (most recent call last):
  File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 63, in <module>
    if __name__ == '__main__':
  File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 59, in main
    read_big_ads_new_csv_and_save_hdf5()
  File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 52, in read_big_ads_new_hdf5
    print(f"Done opening {ads_path_hdf5}")
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3699, in __str__
    return self._head_and_tail_table(format='plain')
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3462, in _head_and_tail_table
    return self._as_table(0, N, format=format)
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3599, in _as_table
    parts = table_part(i1, i2, parts)
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3573, in table_part
    df = self[k1:k2]
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 4626, in __getitem__
    df = self.trim()
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3859, in trim
    df = self if inplace else self.copy()
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 5036, in copy
    df.add_column(name, column, dtype=self._dtypes_override.get(name))
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 6053, in add_column
    super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
  File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 2942, in add_column
    raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
ValueError: array is of length 100000, while the length of the DataFrame is 3

what do I do wrong?

PS: Exactly same error is raised when I try to export to arrow and read arrow. both

vaex-hdf5==0.6.0
vaex-arrow==0.5.1

installed

same code w/o any changes works with vaex==4.0.0a6

ghost commented 3 years ago

How can I download vaex v4 ?