Reading source.coop directly from the AWS endpoint (with duckdb) is fast and reliable:
import ibis
from ibis import _
con = ibis.duckdb.connect()
query= f'''
CREATE OR REPLACE SECRET secret1 (
TYPE S3,
ENDPOINT 's3.us-west-2.amazonaws.com',
URL_STYLE 'path'
);
'''
con.raw_sql(query)
gbif = con.read_parquet("s3://us-west-2.opendata.source.coop/cboettig/gbif/2024-10-01/**")
gbif.count().execute()
succeeds and shows:
Wall time: 19.2 s
2891218873
Reading the same data from the same bucket using the data.source.coop endpoint is slower and usually results in a 502 error. (occasionally a 404 error, which does not happen in the above case reading identical data bucket)
con = ibis.duckdb.connect()
query= f'''
CREATE OR REPLACE SECRET secret1 (
TYPE S3,
ENDPOINT 'data.source.coop',
URL_STYLE 'path'
);
'''
con.raw_sql(query)
gbif = con.read_parquet("s3://cboettig/gbif/2024-10-01/**")
gbif.count().execute()
throws:
HTTPException: HTTP Error: HTTP GET error on 'https://data.source.coop/cboettig/gbif/2024-10-01/000303.parquet' (HTTP 502)
after taking significantly longer to run. (502 error does not always occur on the same parquet file).
(Note that mirroring this bucket to another S3-based system (e.g. CEPH, MINIO) on an alternative endpoint, and querying that with the identical code as above works fine).
Reading source.coop directly from the AWS endpoint (with duckdb) is fast and reliable:
succeeds and shows:
Reading the same data from the same bucket using the data.source.coop endpoint is slower and usually results in a 502 error. (occasionally a 404 error, which does not happen in the above case reading identical data bucket)
throws:
after taking significantly longer to run. (502 error does not always occur on the same parquet file).
(Note that mirroring this bucket to another S3-based system (e.g. CEPH, MINIO) on an alternative endpoint, and querying that with the identical code as above works fine).