Open ivanachillee opened 2 years ago
Which pyarrow version do you have?
@JovanVeljanoski I have pyarrow-7.0.0
which was installed automatically when I ran pip install vaex
We do not set an upper limit on packages usually. I see that only recently pyarrow 7.0.0 was released.
Can you please downgrade pyarrow to 6.0.0 or 6.0.1 and try again?
Although I just checked and pyarrow._s3fs.S3FileSystem
is there in the latest version of pyarrow.. so a bit strange.
Worth to downgrade and try again just in case i think
I downgraded to both 6.0.0 and 6.0.1 and got the exact same error
Oh btw, can you share your versions of s3fs and fsspec please?
fsspec==2021.11.0 s3fs==2021.11.0
Great! I think those libraries often make breaking changes it is so hard to keep up :(
Can you try installing s3fs == 0.5.2 fsspec == 0.8.7 (although 0.8.x should work)
doesn't work, same error.
I've still haven't been able to try out Vaex at all because I can't even simply load the data in the first place 😕
I can't reproduce your error..
can you install everything in a conda/mamba environment perhaps?
If you want to try vaex, you can download the data the old fashioned way and use it like that until we sort this out. There is also
import vaex
df = vaex.example()
I have the same error. Created an envrionment with just vaex installed and have the same error.
was there any luck with this? I get similar error too @rcrafaeldelrey
I was just testing this, and I am afraid I can not reproduce the issue.
For reference, I have the lastest version of vaex installed via conda-forge.
In case of version problems here is my setup:
import vaex
import s3fs
import fsspec
vaex.__version__
s3fs.__version__
fsspec.__version__
# Output
{'vaex-core': '4.9.2',
'vaex-viz': '0.5.2',
'vaex-hdf5': '0.12.2',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.1',
'vaex-jupyter': '0.8.0',
'vaex-ml': '0.17.0'}
'2022.5.0' # s3fs
'2022.5.0' # fsspec
The test was like this: in notebook, i have
import os
os.environ['AWS_PROFILE'] = 'my-profile-name' # This will your profile name
import vaex
df1 = vaex.open('s3://jovans-bucket/jovan/titanic.parquet')
df2 = vaex.open('s3://jovans-bucket/jovan/titanic.hdf5')
And I get everything to work as expected..
Any errors you might be experiencing are likely due to authentication issues.. and i am not an expert there..
Perhaps using s3fs
directly is one way to check if you are authenticated correctly.
@JovanVeljanoski authentication is embedded by default on an EC2 instance when communicating to S3 so that can't possibly be the issue. Especially given that other libraries such as pandas work perfectly fine in attempting to read files from S3 under the exact same circumstances.
Can you try installing Vaex through pip
directly instead of a Conda environment? You're attempting to reproduce the issue in quite a different way from which the origin of the error was described which might be the issue you are not arriving at the same result.
process:
brand new ec2 instance -> pip install vaex
-> import s3fs
/fsspec
-> attempting vaex.open()
file from s3 bucket
@JovanVeljanoski have you tried this? it works for me on files that fail with vaex.open()
:
import pyarrow.parquet as pq table1 = pq.read_table('s3: //realtive_path_to_folder') df1 = vaex.from_arrow_table(table1)
@ivanachillee I just tried with venv, installing everything with pip - it works as before.. I don't have an access to an EC2 instance myself, even the S3 is not really mine, but I use one for testing. I am more on the GCP side of things, perhaps @maartenbreddels can give more insights on what is happening with S3 stuff.
@rey-eb Vaex is using pyarrow under the hood for the cloud I/O, so I don't expect it to fail where pyarrow would not.
Do you notice that vaex.open()
works for some files but not for others?
@JovanVeljanoski
@rey-eb Vaex is using pyarrow under the hood for the cloud I/O, so I don't expect it to fail where pyarrow would not. Do you notice that
vaex.open()
works for some files but not for others?
yes exactly. I have raised the details of this issue on the slack here.
Did you solve the problem @ivanachillee or did you gave up on it?
I'd like to figure it out.. so others don't get stuck as well.. but unfortunately I can't reproduce the problem.. If you or @rey-eb can provide some more details on how to reproduce this.. we can definitely take a look.
If I understood correctly @rey-eb mentioned that she was able to access certain files but not others.. (via slack). Is that correct?
@JovanVeljanoski I can give you the output of readings I get (I generated these just now). The error is raised when I do df.head(2)
and not df.shape
. Also, the second way is much slower. What other info I can give you that would be helpful?
df = vaex.open('s3://PATH/LOAD00000001.parquet') print('shape:', df.shape) print(df.head(2))
import pyarrow.parquet as pq table = pq.read_table('s3://PATH/LOAD00000001.parquet') df = vaex.from_arrow_table(table1) print('shape:', df.shape) print(df.head(2))
I realized that vaex works if you provide a list of the parquet files you are reading, instead of its parent folder. I m using awswrangler to get a list of the parquet files from the S3 bucket I m willing to read, and then vaex.open(list_of_files).
Description I can't load data from s3, by doing this
import vaex
vaex.open("s3://myfile.parquet")
I get the following error
Software information
OS: Ubuntu
Additional information I'm running on an EC2 instance so all the credentials for opening in s3 are already implemented