Open alejandro-ponder opened 2 years ago
@alejandro-ponder thank you for reporting this issue! Modin is actually defaulting to pandas for a different reason: it thinks that the file doesn't exist. pandas can (start to-- I haven't been able to finish after several minutes) read the file with the HTTPS URL, but Modin doesn't allow that.
We are tracking support for reading data from an HTTPS URL in #3170.
But even when I change your path to the s3://
format, I get a different error ending in ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
(shown below). The error originates here in FileDispatcher.file_exists
. We need to investigate that. When I catch the ConnectTimeoutError
there, I get another error when we actually open the file (second sack trace below) also ending in ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
.
@alejandro-ponder never mind, that seems to be some kind of network error on my machine. Even pandas read import pandas as pd; pd.read_csv("s3://nyc-tlc/trip data/yellow_tripdata_2009-01.csv", nrows=10)
works on another machine but not on mine. Could you please try the s3://
path instead of http?
Tried changing path to s3:// and the read takes about 7min 13 seconds, while it takes pandas about 3min 35 seconds for the same workload.
Note, i'm also using error_bad_lines=False argument. not sure if that could be affecting anything?
@alejandro-ponder is this in a fresh environment? Your machine may not have enough memory to do both Modin and pandas in the same notebook/interpreter environment.
it should be. I restarted the kernel between the two runs
@prutskov this issue should be handled now, right?
Yes, now Modin handles https
-like addresses. But I would prefer to run the reproducer before closing the issue.
@alejandro-ponder could you please re-check if this is still happening?
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?
If I try to read compressed data (in my case gzip) from s3, modin doesn't read in parallel.
Can reproduce with the following:
pd.read_csv("https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz",compression='gzip',header=0,sep="\t")