locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 360 forks source link

RangeReader should not require knowing file length #3239

Open echeipesh opened 4 years ago

echeipesh commented 4 years ago

Currently RangeReader implementations will clip read requests to known maximum file size:

https://github.com/locationtech/geotrellis/blob/514e6fa0aeacd883080fde4df2b53d812013d255/util/src/main/scala/geotrellis/util/RangeReader.scala#L41-L42

In order to do this typically a HEAD request needs to be made. This is sometime problamatic on S3 because public requester-pays buckets often only allow GET and not HEAD requests.

In this PR we tried to work around this restriction: https://github.com/locationtech/geotrellis/pull/3025

https://github.com/locationtech/geotrellis/blob/514e6fa0aeacd883080fde4df2b53d812013d255/s3/src/main/scala/geotrellis/store/s3/util/S3RangeReader.scala#L55-L60

by opening a get object stream and reading the length metadata. Followed by abort/close calls to avoid actually pulling down any data.

However, some combination of AWS SDK and Apache http lib can produce the following exception

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 2,196,257,163; received: 0)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
        at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
        at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142)
        at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
        at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.close(ChecksumValidatingInputStream.java:162)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.core.io.SdkFilterInputStream.close(SdkFilterInputStream.java:83)
        at geotrellis.store.s3.util.S3RangeReader.totalLength$lzycompute(S3RangeReader.scala:59)
        at geotrellis.store.s3.util.S3RangeReader.totalLength(S3RangeReader.scala:44)
        at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:109)
        at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:130)
        at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:127)

Issue

This issue is to remove all usage of RangeRader.totalLength method and deprecate it for removal in next major version.

Ultimately we do not need to know the total length of the file. RangeReader interface is used exclusively to read GeoTiffs (and possibly similar files). First few KB of such files are a header that will contain or point to table of offsets. Consequently when we're reading GeoTiff segments we're always basing the read ranges from the offset table and never have a chance to walk off the end of the file. One exception to this is performing the initial header read. In theory the file should never be smaller than that requests, but we should double check the behavior of GeoTiff reader if it is.

As a side benefit we can remove one extra HEAD/GET request per file and avoid paying the AWS costs on that.

pomadchin commented 4 years ago

I'm wondering, would it throw on the line 75? In other words is it a combination of close and abort or it will still throw even without abort?

https://github.com/locationtech/geotrellis/blob/514e6fa0aeacd883080fde4df2b53d812013d255/s3/src/main/scala/geotrellis/store/s3/util/S3RangeReader.scala#L64-L75

Earlier we had no problems with it, since everyone had a dependnecy on the same SDK and the same org.apache.http library.