Open bkachroo opened 1 year ago
I set CPL_CURL_VERBOSE
and the logs look like this:
...
* Couldn't find host sentinel-cogs.s3.us-west-2.amazonaws.com in the (nil) file; using defaults
* getaddrinfo() thread failed to start
* Could not resolve host: sentinel-cogs.s3.us-west-2.amazonaws.com
* Closing connection 0
Error opening 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/43/T/EL/2017/8/S2A_43TEL_20170814_0_L2A/B04.tif': RasterioIOError('CURL error: getaddrinfo() thread failed to start')
...
* Trying 52.218.251.1:443...
* Connected to sentinel-cogs.s3.us-west-2.amazonaws.com (52.218.251.1) port 443 (#0)
...
ALPN, offering h2
* * ALPN, offering h2
* ALPN, offering http/1.1
...
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=*.s3-us-west-2.amazonaws.com
* start date: Sep 21 00:00:00 2022 GMT
* expire date: Aug 24 23:59:59 2023 GMT
* subjectAltName: host "sentinel-cogs.s3.us-west-2.amazonaws.com" matched cert's "*.s3.us-west-2.amazonaws.com"
* issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
* SSL certificate verify ok.
> HEAD /sentinel-s2-l2a-cogs/43/T/EK/2017/8/S2A_43TEK_20170814_0_L2A/B04.tif HTTP/1.1
Host: sentinel-cogs.s3.us-west-2.amazonaws.com
Accept: */*
...
> HEAD /sentinel-s2-l2a-cogs/43/T/EK/2017/5/S2A_43TEK_20170526_0_L2A/B04.tif HTTP/1.1
Host: sentinel-cogs.s3.us-west-2.amazonaws.com
Accept: */*
...
* Server certificate:
* subject: CN=*.s3-us-west-2.amazonaws.com
* start date: Sep 21 00:00:00 2022 GMT
* expire date: Aug 24 23:59:59 2023 GMT
* subjectAltName: host "sentinel-cogs.s3.us-west-2.amazonaws.com" matched cert's "*.s3.us-west-2.amazonaws.com"
* issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
* SSL certificate verify ok.
> HEAD /sentinel-s2-l2a-cogs/43/T/EK/2017/6/S2A_43TEK_20170615_0_L2A/B04.tif HTTP/1.1
Host: sentinel-cogs.s3.us-west-2.amazonaws.com
Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< x-amz-id-2: gQ75z9Q88+GZU00jMDrdk/R1YC7bX0+Cpsmj1nP7FY7af8RwD38GwX3KePG4qR77a6NzcpjEnXI=
< x-amz-request-id: NBB8ZVDWNW3A70JR
< Date: Tue, 28 Feb 2023 22:01:06 GMT
< Last-Modified: Tue, 29 Sep 2020 00:07:12 GMT
< ETag: "2385f248d6f29bf7450b66442c23ea34-26"
< Cache-Control: public, max-age=31536000, immutable
< Accept-Ranges: bytes
< Content-Type: image/tiff; application=geotiff; profile=cloud-optimized
< Server: AmazonS3
< Content-Length: 213606320
<
* Connection #0 to host sentinel-cogs.s3.us-west-2.amazonaws.com left intact
* Couldn't find host sentinel-cogs.s3.us-west-2.amazonaws.com in the (nil) file; using defaults
* Found bundle for host sentinel-cogs.s3.us-west-2.amazonaws.com: 0x7fbdfb6ad1c0 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#0) with host sentinel-cogs.s3.us-west-2.amazonaws.com
* Connected to sentinel-cogs.s3.us-west-2.amazonaws.com (52.218.251.1) port 443 (#0)
> GET /sentinel-s2-l2a-cogs/43/T/EK/2017/8/S2A_43TEK_20170814_0_L2A/B04.tif HTTP/1.1
Host: sentinel-cogs.s3.us-west-2.amazonaws.com
Accept: */*
Range: bytes=0-32767
...
< HTTP/1.1 206 Partial Content
< x-amz-id-2: aYiWDbOY15Cj/9CNwSoG9mdM7npRwnDf13LgJYmZtkzTkJ+KPRr2Txd9Ha/UePLnNSsOO1gW4j8=
< x-amz-request-id: EYK81C9VC290SNWS
< Date: Tue, 28 Feb 2023 22:01:07 GMT
< Last-Modified: Fri, 18 Sep 2020 11:43:44 GMT
< ETag: "929b09a1ad65ad1c7bc81b3eff02ff0b-28"
< Cache-Control: public, max-age=31536000, immutable
< Accept-Ranges: bytes
< Content-Range: bytes 223723520-224821247/233121230
< Content-Type: image/tiff; application=geotiff; profile=cloud-optimized
< Server: AmazonS3
< Content-Length: 1097728
...
* Couldn't find host sentinel-cogs.s3.us-west-2.amazonaws.com in the (nil) file; using defaults
* Found bundle for host sentinel-cogs.s3.us-west-2.amazonaws.com: 0x7fbdfbfd0c50 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#0) with host sentinel-cogs.s3.us-west-2.amazonaws.com
* Connected to sentinel-cogs.s3.us-west-2.amazonaws.com (52.218.251.1) port 443 (#0)
> GET /sentinel-s2-l2a-cogs/43/T/EL/2017/1/S2A_43TEL_20170119_0_L2A/B04.tif HTTP/1.1
Host: sentinel-cogs.s3.us-west-2.amazonaws.com
Accept: */*
Range: bytes=218972160-220151807
...
Except for the section at the very beginning, getaddrinfo
is mentioned nowhere else. The error gets raised in python at that spot, but the curl logs continue with requests while the python program is doing some cleanup and logging before shutting down.
Comparing the verbose curl output from the main repo (before multithreading). These four lines are present in the multithreaded output (which errors), and not present in the serial version. Everything else is the same.
* getaddrinfo() thread failed to start
* Could not resolve host: sentinel-cogs.s3.us-west-2.amazonaws.com
* Closing connection 0
Error opening 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/43/T/EL/2017/8/S2A_43TEL_20170814_0_L2A/B04.tif': RasterioIOError('CURL error: getaddrinfo() thread failed to start')
Instead, it does:
* Couldn't find host sentinel-cogs.s3.us-west-2.amazonaws.com in the (nil) file; using defaults
* Trying 52.218.243.121:443...
* Connected to sentinel-cogs.s3.us-west-2.amazonaws.com (52.218.243.121) port 443 (#0)
Simply skipping that section.
I tried running the threading reads outside of stackstac
, and they work.
setlist = ['https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/T/VR/2017/10/S2A_18TVR_20171006_0_L2A/SCL.tif',
'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/T/VR/2017/10/S2B_18TVR_20171008_0_L2A/SCL.tif',
'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/T/VR/2017/10/S2A_18TVR_20171013_0_L2A/SCL.tif',
]
def readdata(url):
with rio.open(url, sharing=False) as src:
data = src.read(1, window=Window(0, 0, 50, 50))
return data
thread_pool = ThreadPoolExecutor(len(setlist))
futures = []
for index, url in enumerate(setlist):
futures.append(thread_pool.submit(lambda: readdata(url)))
datas = []
for future in as_completed(futures):
datas.append(future.result())
print(datas[-1])
I also tried this with 50 images and it works.
Continuing from coiled/feedback#229.
Previous Discussion
Attempted Solution
I attempted to multithread the reads in
fetch_raster_window
:Problem
Unfortunately, this produces an error:
What I Tried:
thread_pool
gdal_env
rasterio.session
synchronous
schedulerDo you have any suggestions for how to deal with this problem? Or is there a better approach for me to achieve parallelism here?