source-cooperative / data.source.coop

Source Cooperative Data Proxy
https://data.source.coop
MIT License
13 stars 0 forks source link

[Bug] 502 errors streaming from data.source.coop #22

Open cboettig opened 6 days ago

cboettig commented 6 days ago

Reading source.coop directly from the AWS endpoint (with duckdb) is fast and reliable:

import ibis
from ibis import _
con = ibis.duckdb.connect()

query=   f'''
CREATE OR REPLACE SECRET secret1 (
    TYPE S3,
    ENDPOINT 's3.us-west-2.amazonaws.com',
    URL_STYLE 'path'

);
'''

con.raw_sql(query)

gbif = con.read_parquet("s3://us-west-2.opendata.source.coop/cboettig/gbif/2024-10-01/**")
gbif.count().execute()

succeeds and shows:

Wall time: 19.2 s
2891218873

Reading the same data from the same bucket using the data.source.coop endpoint is slower and usually results in a 502 error. (occasionally a 404 error, which does not happen in the above case reading identical data bucket)


con = ibis.duckdb.connect()

query=   f'''
CREATE OR REPLACE SECRET secret1 (
    TYPE S3,
    ENDPOINT 'data.source.coop',
    URL_STYLE 'path'

);
'''

con.raw_sql(query)
gbif = con.read_parquet("s3://cboettig/gbif/2024-10-01/**")
gbif.count().execute()

throws:

HTTPException: HTTP Error: HTTP GET error on 'https://data.source.coop/cboettig/gbif/2024-10-01/000303.parquet' (HTTP 502)

after taking significantly longer to run. (502 error does not always occur on the same parquet file).

(Note that mirroring this bucket to another S3-based system (e.g. CEPH, MINIO) on an alternative endpoint, and querying that with the identical code as above works fine).