vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] CAN'T READ PARQUET FROM AMAZON S3 ON AN EC2 INSTANCE #1926

Open ivanachillee opened 2 years ago

ivanachillee commented 2 years ago

Description I can't load data from s3, by doing this import vaex vaex.open("s3://myfile.parquet")

I get the following error

error opening 's3://data-lake.e [__init__.py](file:///home/ubuntu/.pyenv/versions/3.7.5/lib/python3.7/site-packages/vaex/__init__.py):[259](file:///home/ubuntu/.pyenv/versions/3.7.5/lib/python3.7/site-packages/vaex/__init__.py#259)
                             u-central-1/v1/reporting_tables/reporting_tables                
                             /trackingevents/'                                               
                             Traceback (most recent call last):                              
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/__init__.py", line                  
                             232, in open                                                    
                                 ds = vaex.dataset.open(path,                                
                             fs_options=fs_options, fs=fs, **kwargs)                         
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/dataset.py", line                   
                             73, in open                                                     
                                 return opener.open(path,                                    
                             fs_options=fs_options, fs=fs, *args, **kwargs)                  
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/opener.py",                   
                             line 44, in open                                                
                                 return open_parquet(path, *args, **kwargs)                  
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 345, in open_parquet                                       
                                 return DatasetParquet(path,                                 
                             fs_options=fs_options, fs=fs,                                   
                             partitioning=partitioning, kwargs=kwargs)                       
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 197, in __init__                                           
                                 super().__init__(max_rows_read=max_rows_read                
                             )                                                               
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 26, in __init__                                            
                                 self._create_columns()                                      
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 227, in _create_columns                                    
                                 super()._create_columns()                                   
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 29, in _create_columns                                     
                                 self._create_dataset()                                      
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/arrow/dataset.py",                  
                             line 232, in _create_dataset                                    
                                 self._arrow_ds =                                            
                             pyarrow.dataset.dataset(source,                                 
                             filesystem=file_system,                                         
                             partitioning=self.partitioning)                                 
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/pyarrow/dataset.py", line                
                             667, in dataset                                                 
                                 return _filesystem_dataset(source, **kwargs)                
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/pyarrow/dataset.py", line                
                             420, in _filesystem_dataset                                     
                                 factory = FileSystemDatasetFactory(fs,                      
                             paths_or_selector, format, options)                             
                               File "pyarrow/_dataset.pyx", line 1854, in pya                
                             rrow._dataset.FileSystemDatasetFactory.__init__                 
                               File "pyarrow/error.pxi", line 143, in                        
                             pyarrow.lib.pyarrow_internal_check_status                       
                               File "pyarrow/_fs.pyx", line 1137, in                         
                             pyarrow._fs._cb_get_file_info_selector                          
                               File "/home/ubuntu/.pyenv/versions/3.7.5/lib/p                
                             ython3.7/site-packages/vaex/file/cache.py", line                
                             97, in get_file_info_selector                                   
                                 return self.fs.get_file_info_selector(*args,                
                             **kwargs)                                                       
                             AttributeError: 'pyarrow._s3fs.S3FileSystem'                    
                             object has no attribute 'get_file_info_selector'

Software information

JovanVeljanoski commented 2 years ago

Which pyarrow version do you have?

ivanachillee commented 2 years ago

@JovanVeljanoski I have pyarrow-7.0.0 which was installed automatically when I ran pip install vaex

JovanVeljanoski commented 2 years ago

We do not set an upper limit on packages usually. I see that only recently pyarrow 7.0.0 was released.

Can you please downgrade pyarrow to 6.0.0 or 6.0.1 and try again?

JovanVeljanoski commented 2 years ago

Although I just checked and pyarrow._s3fs.S3FileSystem is there in the latest version of pyarrow.. so a bit strange. Worth to downgrade and try again just in case i think

ivanachillee commented 2 years ago

I downgraded to both 6.0.0 and 6.0.1 and got the exact same error

JovanVeljanoski commented 2 years ago

Oh btw, can you share your versions of s3fs and fsspec please?

ivanachillee commented 2 years ago

fsspec==2021.11.0 s3fs==2021.11.0

JovanVeljanoski commented 2 years ago

Great! I think those libraries often make breaking changes it is so hard to keep up :(

Can you try installing s3fs == 0.5.2 fsspec == 0.8.7 (although 0.8.x should work)

ivanachillee commented 2 years ago

doesn't work, same error.

ivanachillee commented 2 years ago

I've still haven't been able to try out Vaex at all because I can't even simply load the data in the first place 😕

JovanVeljanoski commented 2 years ago

I can't reproduce your error..

can you install everything in a conda/mamba environment perhaps?

JovanVeljanoski commented 2 years ago

If you want to try vaex, you can download the data the old fashioned way and use it like that until we sort this out. There is also

import vaex
df = vaex.example()
rcrafaeldelrey commented 2 years ago

I have the same error. Created an envrionment with just vaex installed and have the same error.

rey-eb commented 2 years ago

was there any luck with this? I get similar error too @rcrafaeldelrey

JovanVeljanoski commented 2 years ago

I was just testing this, and I am afraid I can not reproduce the issue.

For reference, I have the lastest version of vaex installed via conda-forge.

In case of version problems here is my setup:

import vaex
import s3fs
import fsspec

vaex.__version__
s3fs.__version__
fsspec.__version__

# Output
{'vaex-core': '4.9.2',
 'vaex-viz': '0.5.2',
 'vaex-hdf5': '0.12.2',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.1',
 'vaex-jupyter': '0.8.0',
 'vaex-ml': '0.17.0'}

'2022.5.0' # s3fs

'2022.5.0' # fsspec

The test was like this: in notebook, i have

import os
os.environ['AWS_PROFILE'] = 'my-profile-name'  # This will your profile name

import vaex

df1 = vaex.open('s3://jovans-bucket/jovan/titanic.parquet')
df2 = vaex.open('s3://jovans-bucket/jovan/titanic.hdf5')

And I get everything to work as expected..

Any errors you might be experiencing are likely due to authentication issues.. and i am not an expert there.. Perhaps using s3fs directly is one way to check if you are authenticated correctly.

ivanachillee commented 2 years ago

@JovanVeljanoski authentication is embedded by default on an EC2 instance when communicating to S3 so that can't possibly be the issue. Especially given that other libraries such as pandas work perfectly fine in attempting to read files from S3 under the exact same circumstances.

Can you try installing Vaex through pip directly instead of a Conda environment? You're attempting to reproduce the issue in quite a different way from which the origin of the error was described which might be the issue you are not arriving at the same result.

process: brand new ec2 instance -> pip install vaex -> import s3fs/fsspec-> attempting vaex.open() file from s3 bucket

rey-eb commented 2 years ago

@JovanVeljanoski have you tried this? it works for me on files that fail with vaex.open():

import pyarrow.parquet as pq table1 = pq.read_table('s3: //realtive_path_to_folder') df1 = vaex.from_arrow_table(table1)

JovanVeljanoski commented 2 years ago

@ivanachillee I just tried with venv, installing everything with pip - it works as before.. I don't have an access to an EC2 instance myself, even the S3 is not really mine, but I use one for testing. I am more on the GCP side of things, perhaps @maartenbreddels can give more insights on what is happening with S3 stuff.

@rey-eb Vaex is using pyarrow under the hood for the cloud I/O, so I don't expect it to fail where pyarrow would not. Do you notice that vaex.open() works for some files but not for others?

rey-eb commented 2 years ago

@JovanVeljanoski

@rey-eb Vaex is using pyarrow under the hood for the cloud I/O, so I don't expect it to fail where pyarrow would not. Do you notice that vaex.open() works for some files but not for others?

yes exactly. I have raised the details of this issue on the slack here.

JovanVeljanoski commented 2 years ago

Did you solve the problem @ivanachillee or did you gave up on it?

I'd like to figure it out.. so others don't get stuck as well.. but unfortunately I can't reproduce the problem.. If you or @rey-eb can provide some more details on how to reproduce this.. we can definitely take a look.

If I understood correctly @rey-eb mentioned that she was able to access certain files but not others.. (via slack). Is that correct?

rey-eb commented 2 years ago

@JovanVeljanoski I can give you the output of readings I get (I generated these just now). The error is raised when I do df.head(2) and not df.shape. Also, the second way is much slower. What other info I can give you that would be helpful?

df = vaex.open('s3://PATH/LOAD00000001.parquet') print('shape:', df.shape) print(df.head(2))

Screenshot 2022-07-04 at 10 29 01

import pyarrow.parquet as pq table = pq.read_table('s3://PATH/LOAD00000001.parquet') df = vaex.from_arrow_table(table1) print('shape:', df.shape) print(df.head(2))

Screenshot 2022-07-04 at 10 48 36
rcrafaeldelrey commented 2 years ago

I realized that vaex works if you provide a list of the parquet files you are reading, instead of its parent folder. I m using awswrangler to get a list of the parquet files from the S3 bucket I m willing to read, and then vaex.open(list_of_files).