openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

JPEG 2000 files downloaded from USGS are marked as invalid #192

Closed orencohendev closed 1 year ago

orencohendev commented 1 year ago

Here's what I did:

from jpylyzer import jpylyzer
result = jpylyzer.checkOneFile("USGS_FILE.jp2")
print(result.find("isValid").text)

The result is False The file is clearly a JP2 with geographical data and can be viewed on QGIS.

Here's an example file to reproduce this with: https://prd-tnm.s3.amazonaws.com/StagedProducts/NAIP/ca_2016/37122/m_3712213_se_10_h_20160625_20161004.jp2

bitsgalore commented 1 year ago

I can confirm I'm able to reproduce this result. Here are the tests that failed validation:

   <tests>
        <contiguousCodestreamBox>
            <foundExpectedNumberOfTiles>False</foundExpectedNumberOfTiles>
            <foundExpectedNumberOfTileParts>False</foundExpectedNumberOfTileParts>
            <tileParts>
                <tilePart>
                    <sot>
                        <psotIsValid>False</psotIsValid>
                    </sot>
                    <foundNextTilePartOrEOC>False</foundNextTilePartOrEOC>
                </tilePart>
            </tileParts>
        </contiguousCodestreamBox>
    </tests>

This is not a Jpylyzer bug, but it simply indicates there's a problem with this JP2. In particular, the actual number of tiles in this file does not correspond with the expected number of tiles (as defined by the file's SIZ marker segment).

orencohendev commented 1 year ago

This is reproducing for me on other GIS-related JP2s from USGS. Could it be different behavior for files that are meant for GIS usage?

bitsgalore commented 1 year ago

Hi Oren,

I just gave this a closer look, and what's happening is basically this.

Based on the overall geometry of the image which is defined in the SIZ marker, Jpylyzer calculates the expected number of tiles. Details are given here: https://jpylyzer.openpreservation.org/doc/latest/userManual.html#siz-marker.

In this case this yields a number of 12 expected tiles. This is reported by Jpylyzer as the numberOfTiles property (I uploaded the full Jpylyzer output for this image here).

Each tile is made up of one or more tile-parts. Each tile part starts with a start-of-tilepart marker (SOT), which defines a set of properties described here https://jpylyzer.openpreservation.org/doc/latest/userManual.html#sot-marker.

One of these properties is the tile index (reported as property isot), which defines the tile to which a tile-part belongs. Here's an example of Jpylyzer's output for one tile part:

<sot>
    <lsot>10</lsot>
    <isot>0</isot>
    <psot>1134</psot>
    <tpsot>0</tpsot>
    <tnsot>255</tnsot>
</sot>

For a JP2 with 12 different tiles, you would expect values in the range of 0 to 11. But if you look at Jpylyzer's full output, you'll see only 3 values for isot: 0, 1 and 2. So the tile parts only cover 3 out of the 12 tiles that are part of this image!

Another red flag is the following error:

 <foundNextTilePartOrEOC>False</foundNextTilePartOrEOC>

This error happens while Jpylyzer is iterating over the tile parts. For each new iteration in this loop, for a structurally valid JP2 only two outcomes are possible:

  1. The byte position at the start of the iteration points to a new tile-part, as indicated by Start-Of-Tilepart marker code (SOT, 0xFF90)
  2. Or, following the final tile-part, the byte position at the start of the iteration points to the end of the codestream (as indicated by End-Of-Codestream marker 0xFFD9.

Anything different from this indicates a strucurally malformed or damaged file.

Out of curiosity I opened your JP2 in a Hex editor. Towards the end of the file I saw this: usgs-sot

Basically this looks like a sequence of Start-Of-Tilepart markers (marker code highlighted in red), each followed by a Start-Of-Data marker (SOD, 0xFF93). The SOD is supposed to be followed by the tile part's actual bit stream data, but instead there's just a new SOT, and the bit stream data are missing altogether!

As a further test I tried to decode the image with OpenJPEG's opj_decompress tool. This worked, but resulted in an endless list of this warning:

[WARNING] Empty SOT marker detected: Psot=12.
[WARNING] Empty SOT marker detected: Psot=12.
[WARNING] Empty SOT marker detected: Psot=12.
[WARNING] Empty SOT marker detected: Psot=12.
[WARNING] Empty SOT marker detected: Psot=12.
::

Which also indicates the presence of empty tile parts.

Since you mention that other USGS JP2s are also affected, my best guess is that the production workflow they're using has some serious flaws, resulting in missing data and, ultimately, a malformed overall file structure. The fact that this is a 4-channel image that is meant for GIS usage has nothing to do with this, because this doesn't affect the overall file structure.

bitsgalore commented 1 year ago

Small addition - I found this old (2017) thread in an xnview forum, where someone reports the exact same problem:

https://newsgroup.xnview.com/viewtopic.php?t=35877

I also found this:

https://www.sciencebase.gov/catalog/item/58282427e4b01fad870f9744

The image that is linked to on that page results in the same validation errors. So this could mean a lot of affected images!

bitsgalore commented 1 year ago

Closing this issue as this looks like a fault of the USGS images, not Jpylyzer.