Open Bchi1994 opened 4 years ago
Hi @Bchi1994, thanks for reaching out.
I assume you let the process finish and when it is done you get no error message here right?
Could you please try to read just a single chunk of the csv file, to see if everything is ok?
df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)
If that works, can you and export it:
df_chunk.export('./test_chunk.hdf5')
And try to read that?
df = vaex.open('./test_chunk.hdf5')
Just to try and narrow down the problem.
(Any other ideas @byaminov )
Error:
TypeError: parser_f() got an unexpected keyword argument 'nrow'
try nrows
Was able to run with nrows. Export gave me this error:
Unable to create file (unable to open file: name = 'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02/test_chunk.hdf5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 302)
can you please be more explicit? Did any of the above lines work? Which is the line that raises the error?
df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)
Worked!
df_chunk.export('./test_chunk.hdf5') - resulted in error
ok cool!
So if possible, can you please send a screenshot of the data, and/or the output of
df_chunk.dtypes
?
See below. Can I email you a link to google drive? The data is quite large.
df_chunk.dtypes Out[3]: underlying_symbol <class 'str'> quote_datetime <class 'str'> root <class 'str'> expiration <class 'str'> strike float64 option_type <class 'str'> open float64 high float64 low float64 close float64 trade_volume int64 bid_size int64 bid float64 ask_size int64 ask float64 underlying_bid float64 underlying_ask float64 implied_underlying_price float64 active_underlying_price float64 implied_volatility float64 delta float64 gamma float64 theta float64 vega float64 rho float64 dtype: object
That is strange.. looks like nothing weird. If the data is not very confidential, is it possible for you to send us a small chunk of it?
something like this:
df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrow=5_000_000)
df_chunk.export_csv('testfile.csv')
And maybe attach that testfile.csv here? Or I can give you my email if you don't want to put it public on github.
Smaller batch attached.
Thanks. I'll look into it and get back to you.
Thank you Thank you! Sorry, it should be a csv file, not a txt file. I converted it by accident. Please change it to csv
Here is it in CSV. CBOEtrimmed2 (2).zip
Hi @Bchi1994
I can read / write the same you sent me without any problems.
Actually, looking at your code and the error message that appears, are you sure that your export path is valid?
The error states "No such file or directory" when you try to write the file, and your output path seems somewhat inconsistent with the path from which you read the data. Can you perhaps try with a much simpler path just in case?
Otherwise, in your console (where you run pip commands) execute:
pip show vaex-hdf5
and tell us what is says?
Ok. So I changed the export to my desktop and it runs with no errors. However, the exported hdf5 test.zip file is tiny (attached).
import vaex as vs
df_chunk = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', nrows=5_000_000)
df_chunk.export(r'C:\Users\Benjamin Heller\Desktop\test.hdf5')
I am using pycharm, not sure how to do the pip show vaex-hdf5
Yeah the file you sent is invalid for some reason.
So how did you install vaex? If you are using conda, on the conda promt/command line you can type: pip show vaex-hdf5
Seems like an installation issue to me..
I installed as a Pycharm package and added to my Project Interpreter. I am using Python 3.8 (no conda).
Same issue on 2 separate computers. Both windows 10
Ah in that case I have no idea. Maybe @byaminov @xdssio might be able to help. I've not used pycharm and unfortunately i have no access to a windows machine.
See if you can google and find out how to check versions of installed packages via pycharm, and see what version of vaex-core and vaex-hdf5 you have.
The sample of data you shared seems clean and it should be able to be converted without any problems..
Seems to work on my friend's mac. Not sure why Pycharm windows is having issues.
Resolved! - Vaex does not play nice with Python 3.8. Must use vaex on Python 3.7.
Ah indeed, you are right, on windows python3.8 is not fully supported yet.
It should be right? CI runs 38 as well. Seems like a bug, maybe also present on other os'es?
(from mobile phone)
On Sun, 31 May 2020, 12:48 Jovan Veljanoski, notifications@github.com wrote:
Ah indeed, you are right, on windows python3.8 is not fully supported yet.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/775#issuecomment-636454018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPFTM7IHHSV5A3UIPLRUIYXTANCNFSM4NPBMJGQ .
Jovan corrected me, indeed, CI is not yet running on windows with 3.8 https://github.com/vaexio/vaex/blob/d7c32e046dd3da6eaf773221bd74bdeed2127ab2/.github/workflows/pythonpackage.yml#L21
Keeping this open as a bug.
Hi, trying more or less the same example on MacOS. python3 + vaex 3.0.0
ads_path_csv = "test-data/BIG_ADS_NEW.csv"
ads_path_hdf5 = "test-data/hdf5/BIG_ADS_NEW_*.hdf5"
hdf5_ads_dir = 'test-data/hdf5'
def read_big_ads_new_csv_and_save_hdf5():
"""
Takes 20 seconds to save 10 chunks
:return:
"""
start_time = time.time()
vaex_chunks = vaex.from_csv(ads_path_csv, copy_index=False, chunk_size=100_000, delimiter=';', header='infer')
if os.path.exists(hdf5_ads_dir):
os.remove(hdf5_ads_dir)
if not os.path.exists(hdf5_ads_dir):
os.mkdir(hdf5_ads_dir)
for idx, chunk in enumerate(vaex_chunks):
export_path = ads_path_hdf5.replace("*", str(idx))
print(f"export chunk {idx} to {export_path}")
print(chunk.head(1))
chunk.export(export_path)
stop_time = time.time()
print(f"read_big_ads_new_csv_and_save_hdf5: {stop_time - start_time} seconds")
print(chunk.head(1))
looks ok, input CSV file contain 1M records, I see that 10 hdf5 files were created.
Sample output:
export chunk 11 to test-data/hdf5/BIG_ADS_NEW_11.hdf5
Now I try to read it:
def read_big_ads_new_hdf5():
start_time = time.time()
print(f"Opening {ads_path_hdf5}")
df = vaex.open(ads_path_hdf5)
print(f"Done opening {ads_path_hdf5}")
print(df.head(3))
stop_time = time.time()
print(f"read_big_ads_new_hdf5: {stop_time - start_time} seconds")
Output:
Opening test-data/hdf5/BIG_ADS_NEW_*.hdf5
Done opening test-data/hdf5/BIG_ADS_NEW_*.hdf5
And then:
Traceback (most recent call last):
File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 63, in <module>
if __name__ == '__main__':
File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 59, in main
read_big_ads_new_csv_and_save_hdf5()
File "/Users/my_user/develop/test_vaex/test_vaex/ml-backend-vaex/prepare-model-input.py", line 52, in read_big_ads_new_hdf5
print(f"Done opening {ads_path_hdf5}")
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3699, in __str__
return self._head_and_tail_table(format='plain')
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3462, in _head_and_tail_table
return self._as_table(0, N, format=format)
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3599, in _as_table
parts = table_part(i1, i2, parts)
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3573, in table_part
df = self[k1:k2]
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 4626, in __getitem__
df = self.trim()
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 3859, in trim
df = self if inplace else self.copy()
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 5036, in copy
df.add_column(name, column, dtype=self._dtypes_override.get(name))
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 6053, in add_column
super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
File "/Users/my_user/.pyenv/versions/3.7.3/lib/python3.7/site-packages/vaex/dataframe.py", line 2942, in add_column
raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
ValueError: array is of length 100000, while the length of the DataFrame is 3
what do I do wrong?
PS: Exactly same error is raised when I try to export to arrow and read arrow. both
vaex-hdf5==0.6.0
vaex-arrow==0.5.1
installed
same code w/o any changes works with vaex==4.0.0a6
How can I download vaex v4 ?
Hello,
I am working with some large data and am using vaex to convert a csv file to and hdf5 file. Unfortunately, I am having issues with the code and was hoping someone could please help!
Here is the code:
import vaex as vs
df = vs.from_csv(r'D:\CBOE - 2020.05.29\item_000017476\Zip Files\Test_CSV\UnderlyingOptionsIntervalsCalcs_60sec_2005-02.csv', convert=True, chunk_size=5_000_000)
Unfortunately, it is only outputting a 1KB hdf5 file that is empty. It should be closer to 1GB as that is the size of the CSV file. Any help would be greatly appreciated.
Thanks!