vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

Issue vaex open from AWS S3 #704

Closed pedrohesch closed 3 years ago

pedrohesch commented 4 years ago

I am trying vaex open as follows: df2 = vaex.open('s3://viacao-sampaio/HDF5/master_df.hdf5?profile_name=pedroAI')

but I am receveing the following errors:

before I copy the error message here, I would like to make 3 notes: 1- When I vaex.open the same file from local computer, is OK. 2- When I vaex.open a small file with the same line of code, is OK. 3- This file , master_df.hdf5, is a 10GB with more than 40 millions lines.

ERROR:MainThread:vaex:error evaluating: CODIGO_DESTINO at rows 40800736-40800741 Traceback (most recent call last): File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 3523, in table_part values[name] = df.evaluate(name) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5120, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, internal=internal, parallel=parallel, chunk_size=chunk_size) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5261, in _evaluate_implementation result = [finalize_result(k) for k in expressions] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5261, in result = [finalize_result(k) for k in expressions] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5249, in finalize_result values = to_numpy(chunks[0]) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\array_types.py", line 9, in to_numpy x = x.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 414, in to_numpy return self.string_sequence.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 370, in string_sequence self._string_sequence = string_type(_asnumpy(self.bytes), _asnumpy(self.indices), self.length, self.offset, _asnumpy(self.null_bitmap), self.null_offset) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 326, in _asnumpy return ar.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\column.py", line 73, in to_numpy return self[0:self.length] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\column.py", line 159, in getitem ar = file._as_numpy(offset, byte_length, self.dtype) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\cache.py", line 143, in _as_numpy self._ensure_cached(offset, offset+byte_length) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\cache.py", line 155, in _ensure_cached self.file.seek(start_blocked) File "C:\Users\pedro\Anaconda3\lib\site-packages\s3fs\core.py", line 1293, in seek raise ValueError('Seek before start of file') ValueError: Seek before start of file ERROR:MainThread:vaex:error evaluating: CODIGO_DESTINO at rows 40800736-40800741 Traceback (most recent call last): File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 3523, in table_part values[name] = df.evaluate(name) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5120, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, internal=internal, parallel=parallel, chunk_size=chunk_size) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5261, in _evaluate_implementation result = [finalize_result(k) for k in expressions] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5261, in result = [finalize_result(k) for k in expressions] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5249, in finalize_result values = to_numpy(chunks[0]) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\array_types.py", line 9, in to_numpy x = x.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 414, in to_numpy return self.string_sequence.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 370, in string_sequence self._string_sequence = string_type(_asnumpy(self.bytes), _asnumpy(self.indices), self.length, self.offset, _asnumpy(self.null_bitmap), self.null_offset) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\column.py", line 326, in _asnumpy return ar.to_numpy() File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\column.py", line 73, in to_numpy return self[0:self.length] File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\column.py", line 159, in getitem ar = file._as_numpy(offset, byte_length, self.dtype) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\cache.py", line 143, in _as_numpy self._ensure_cached(offset, offset+byte_length) File "C:\Users\pedro\Anaconda3\lib\site-packages\vaex\file\cache.py", line 155, in _ensure_cached self.file.seek(start_blocked) File "C:\Users\pedro\Anaconda3\lib\site-packages\s3fs\core.py", line 1293, in seek raise ValueError('Seek before start of file') ValueError: Seek before start of file

How to fix it? thanks in advance

maartenbreddels commented 4 years ago

Hi Pedro,

thanks for opening the issue. Can you share with me: pip show s3fs?

Also, could you check if you have enough disk space, and you could try removing/moving ~/.vaex/file-cache/s3/ to see if that helps. If you move it, you can later restore it to reproduce the error, to help us track down the error.

Regards,

Maarten

pedrohesch commented 4 years ago

pip show s3fs:

Name: s3fs Version: 0.2.2 Summary: Convenient Filesystem interface over S3 Home-page: http://github.com/dask/s3fs/ Author: None Author-email: None License: BSD Location: c:\users\pedro\anaconda3\lib\site-packages Requires: boto3, six, botocore Required-by: vaex-hdf5

pedrohesch commented 4 years ago

I removed everything from ~/.vaex/file-cache/s3/ . Then I got 16GB of free space in the disk. The file has 10GB. But I am still receiveing the same error.

maartenbreddels commented 4 years ago

Running the same version here. Could you contact me privately at maartenbreddels@gmail.com maybe we can find a way to give me access to this file.

JovanVeljanoski commented 3 years ago

Closing as stale. Please re-open if needed.