Open charalamm opened 9 months ago
Thanks for raising this @charalamm, better error handling and tracking is certainly needed, see #101. It can be a little bit tricky to support consistently across Dask and direct loads though. Right now, a major refactor of the loading code is taking place to support hyperspectral data sources. As part of that work we are adding an IO driver abstraction that allows user to bring their own loader, mostly to enable efficient access to data sources that rasterio/gdal struggle with. Once completed, we should be in a much better position to experiment with various error handling approaches and to give library users more control over that aspect of things when they need it.
Initially that would be implemented with various forms of callbacks into user code to make a decision or to keep track of failures, as we develop better understanding we will provide non-code mechanisms, like your suggested regex-based matching. My concern is with rasterio/GDAL boundary, at least in the past it was not always possible to bubble up GDAL errors in to Python code without losing some fidelity in error reporting (just because you see an error printed to stderr, doesn't mean Python has access to that same information in the exception object).
In the meantime have you experimented with settings available within GDAL, things like GDAL_HTTP_MAX_RETRY
and others in GDAL_HTTP*
and CPL_VSIL_CURL*
families? The fact that you are suspecting bit-errors in http responses you receive is worrying, there have been cache corruption issues in GDAL in the past, but it could also be in your infrastructure, given that you also observe dns errors (is this inside k8s?).
Hello @Kirill888 thanks for your immediate response.
Yes unfortunately my network is not great.
I have experimented with the gdal environment variables but I did not notice any difference. I think that is because the reading status codes are 500 or GDAL can not even connect so GDAL_HTTP_MAX_RETRY
doesn't get activated. The only thing that worked for me (not with odc-stac but with stackstac) is not cache responses and catch the errors and retry from python, however I am not sure if that putted any performance overhead.
Hello,
We are planning to use odc stac for some analysis. We have the data on azure and we accessing them with the
az://
prefix. In every analysis, when trying to read the files there are always some errors with the internet which result on the data missing from the final data structure.So far I have catched the following errors:
Do you think it is useful to add a mechanism to retry reading on some errors? I think I can work on a PR if you are interested in this feature. Feel free to close it if you are not interested
A possible approach?
Since some of these errors can be valid ones it should be on the user to decide I they want to retry or not and on what errors to retry. One option would be to allow the user define a list of regexes or strings and odc-stac can check if it should retry based on that. One problem is that GDAL is caching these errors so it might be needed to use
CPL_VSIL_CURL_NON_CACHED