mmomtchev / node-gdal-async

Node.js bindings for GDAL (Geospatial Data Abstraction Library) with full async support
https://mmomtchev.github.io/node-gdal-async/
Apache License 2.0
133 stars 26 forks source link

Uncatchable error is thrown if TIFFReadEncodedStrip() fails #56

Closed potion-cellar closed 1 year ago

potion-cellar commented 1 year ago

Hello, big fan of this library. Have run into an issue in a massive weather model processing pipeline. Every now and then I get an uncatchable error:

try {
    const outDs = await inDs.driver.createCopyAsync(
        outPath,
        inDs
    );
catch {
   ...never reached
}
node:events:491
      throw er; // Unhandled 'error' event
      ^

Error: TIFFReadEncodedStrip() failed.
Emitted 'error' event on RasterReadStream instance at:
    at emitErrorNT (node:internal/streams/destroy:157:8)
    at emitErrorCloseNT (node:internal/streams/destroy:122:3)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)

inDs is a geotiff stored in /vsimem/, I have verified that the input is a valid gdal.Dataset etc. This works 99% of the time but every now and then the error occurs, which is weird because presumably the geotiff being copied into /vsimem/ is valid (and rerunning the pipeline against the same data usually works).

Anyways, that's a secondary concern, the main problem is that the error completely crashes the node process and it has to be restarted and I can't seem to catch it.

Let me know if more reproduction steps are necessary...a bit tricky as the error occurs when trying to copy a raster from memory :)

This has happened on both 3.6.1 and 3.5.3

mmomtchev commented 1 year ago

The error is uncatchable because it is emitted in the event handler of a stream. Normally, to catch this error, one should listen to the error event on the stream. Where is this stream coming from? Driver.createCopyAsync is a C++ function, it does not use the streams API. Do you also use calcAsync or a RasterReadStrem?

potion-cellar commented 1 year ago

Yes there is a calcAsync involved. Sorry, this is in a pipeline with a lot of unresolved promises doing work, so after investigating some more I think it's coming from the calcAsync which is very closely located to the above code snippet, I had to add a bunch of logging around what was happening right when the error occurs as there is no useful stacktrace from the error itself.

BTW, did fix some rare race condition that was triggering the error in the first place (dataset evidently being written to / read from multiple sources simultaneously) but it still crops up extremely infrequently every now and then, whenever I have a calcAsync.

mmomtchev commented 1 year ago

Can you systematically reproduce the error by using a corrupted GTiff? Take a normal GTiff and truncate it, then verify that trying to transcode it with gdal_translate returns the same error.

mmomtchev commented 1 year ago

Also, try this version to see if it fixes your problem: https://github.com/mmomtchev/node-gdal-async/tree/catch-exceptions

potion-cellar commented 1 year ago

@mmomtchev I haven't had the time to take a look but did take note of your merge, assuming you were able to replicate the issue. Looking forward to seeing this upstream. Thank you for addressing this so quickly.