pvl.load() Hangs Indefinitely on Some PDS3 Files with Attached Labels

dpmayerUSGS commented 2 years ago

Describe the bug I routinely use pvl to load and parse PDS labels from LROC NAC EDRs. In a tiny minority of cases, I've found that pvl.load() will hang indefinitely when trying to read particular files.

To Reproduce

Download an affected LROC NAC EDR: https://pdsimage2.wr.usgs.gov/Missions/Lunar_Reconnaissance_Orbiter/LROC/EDR/LROLRC_0029/DATA/ESM3/2016300/NAC/M1232125546LE.IMG

Attempt to load the label with pvl:

import pvl
pvl.load('M1232125546LE.IMG')

Wait a little while and observe that pvl never seems to load the label (I waited up to 4 hours in testing, but ~1 minute would be enough)
Send a keyboard interrupt (Ctrl + c)

Here's the traceback:

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/__init__.py", line 70, in load
    return loads(
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/__init__.py", line 213, in loads
    return parser.parse(s)
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/parser.py", line 845, in parse
    return super().parse(nodash)
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/parser.py", line 207, in parse
    module = self.parse_module(tokens)
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/parser.py", line 256, in parse_module
    parsed = p(tokens)
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/parser.py", line 500, in parse_end_statement
    t = next(tokens)
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/lexer.py", line 406, in lexer
    tok = Token(lexeme, grammar=g, decoder=d, pos=firstpos(lexeme, i))
  File "/home/pds2010mgr/anaconda3/envs/PDS-Services/lib/python3.9/site-packages/pvl/token.py", line 31, in __new__
    return str.__new__(cls, content)
KeyboardInterrupt

By contrast, if you run pvl.load() on say, https://pdsimage2.wr.usgs.gov/Missions/Lunar_Reconnaissance_Orbiter/LROC/EDR/LROLRC_0026/DATA/ESM2/2015365/NAC/M1206236758LE.IMG , the label will load normally.

Expected behavior If there's something pathologically bad with the labels in a select minority of LROC NAC EDRs , I would expect pvl to recognize the badness and emit an error rather than hanging. Alternatively, if the example product is in fact a valid PDS3 PVL label, I would expect pvl to load it just as it loads other LROC NAC EDR labels.

Your Environment (please complete the following information):

OS: Linux
Environment information: Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)
pvl Version 1.3.1

Additional context (Other Things I've Tried) pvl_validate I also ran the commandline utility pvl_validate on the example files mentioned above. The program hangs when run on M1232125546LE.IMG, but if I mash Ctrl + c a few times it will print

^C^C^C^CPDS3 | does NOT load |
ODL  | does NOT load |
PVL  |     Loads     |     Encodes
ISIS | does NOT load |
Omni | does NOT load |

Contrast with pvl_validate M1206236758LE.IMG, which runs normally and prints

PDS3 |     Loads     | does NOT encode
ODL  |     Loads     | does NOT encode
PVL  |     Loads     |     Encodes
ISIS |     Loads     |     Encodes
Omni |     Loads     |     Encodes

Processing in ISIS The ISIS program lronac2isis loads both example files normally and converts them to ISIS cubes correctly. The program isn't pedantic about PDS3 PVL, but the fact that it runs normally where pvl.load() gets stuck implies that the labels are minimally viable.

Detaching Labels and Parsing with pvl I used less to peak at the labels of the example products and noted the reported sizes of the label records. I took this information and dumped only the label portion of the products to new files. Interestingly, the label for the bad file says it contains 2 label records of 2532 bytes each, whereas the other example file says it contains a single label record of 5064 bytes. Thus, the reported size of both labels are the same. I don't have enough example data to know if this difference has anything to with the issue I'm reported, but thought it was worth mentioning.

# Bad file
$ dd bs=2532 count=2 if=M1232125546LE.IMG of=M1232125546LE.LBL
$ pvl_validate M1232125546LE.LBL

> PDS3 |     Loads     | does NOT encode
> ODL  |     Loads     | does NOT encode
> PVL  |     Loads     |     Encodes
> ISIS |     Loads     |     Encodes
> Omni |     Loads     |     Encodes

# Normal file
$ dd bs=5064 count=1 if=M1206236758LE.IMG of=M1206236758LE.LBL
$ pvl_validate M1206236758LE.LBL

> PDS3 |     Loads     | does NOT encode
> ODL  |     Loads     | does NOT encode
> PVL  |     Loads     |     Encodes
> ISIS |     Loads     |     Encodes
> Omni |     Loads     |     Encodes

pvl_validate reports the same thing for both detached labels. Similarly, pvl.load() loads the labels normally when run on the detached labels. So there doesn't seem to be an issue when the labels are detached.

How does pvl.load() know where to stop reading when the input file has a label attached to an image? Perhaps there's a weird character at the transition between the label portion and the image portion of M1232125546LE.IMG that's causing pvl to hang?

rbeyer commented 2 years ago

David, thank you for this wonderfully detailed Issue statement. I'm sure this will help me figure out the problem quickly.

So pvl.load() first tries to read the whole file as text (an optional encoding can be provided), and if so, then starts parsing it. In general, cube files fail this attempt, and then pvl.load() evaluates makes a second try and evaluates the file byte by byte, and as long as it can decode the bytes as UTF characters, it stores that text, and when it hits a byte that doesn't decode as UTF (which is typically where the label ends and the "data" begins in attached-label files) it takes that as a stop sign, and then tries to parse the text that it has collected.

As you say, there are very likely some bytes that are causing some shenanigans that I haven't accounted for that is gumming up the works somewhere.

rbeyer commented 2 years ago

Aha! Well, it wasn't a weird character, the opposite, in fact.

So as I mentioned above, in most "attached" labels, we can't just read all the bytes as UTF characters, so we have to go byte-by-byte through the file until we run out of bytes that we can convert to UTF, and then send the "good" ones to be parsed. However, you won the lottery: you found a .IMG file for which all of the bytes in that file actually do convert to UTF characters.

So how is that relevant? Well first, that happens. Then all of this "valid" text gets fed through the parser and lexer. It actually happily reads all of the tokens and gets to the "END" token and cleanly parses it. However, in the PVL spec, there is the capability for a final, trailing comment to be placed after the END token. The pvl library doesn't currently have the ability to really do anything with comments other than safely skip them, but I left some hooks in the code for some glorious future when we could do something with comments.

So after the END token is safely parsed, the pvl library asks the lexer for the "next" token. It turns out that while there are "whitespace" UTF characters after the END token, once the "data" starts--which somehow all converts to UTF characters--there are no whitespace characters. So when the parser asks for the "next" token to see if it is a comment (which it currently won't do anything about), the lexer begins analyzing all of the remaining 132 million valid characters to see when it can find some whitespace (its more complicated than that, because it also tests for comments and reserved characters, so it isn't as simple as just calling split()) so it can return a token to be parsed. I didn't do your test, but either it was still fastidiously lexing characters after four hours or it ran out of memory or something). Anyway, that request for the "next token" after the END does not return in a reasonable amount of time, and that's what you experienced.

The solution is to just remove that request for the "next token" after an END delimiter. If in the glorious future the pvl library does do something fancy with comments, then we'll have to figure out how to deal safely with telling the difference between a valid post-END comment and this wild situation.

I'll work on a patch and get pvl 1.3.2 baked in the next week or so.

In the mean time, you can load the PVL out of this IMG file by doing this:

m = pvl.load("M1232125546LE.IMG", grammar=pvl.grammar.PVLGrammar())

This is because the PVLGrammar() and even the ISISGrammar() are more narrowly interpeting the bytes as ASCII characters instead of the wider UTF characters that the default OmniGrammar() is accepting, so they're not reading 132 million extra characters and just find nothing after the END token and then return quickly.

Thanks for finding this, David!

rbeyer commented 2 years ago

Fixed in Release 1.3.2. On PyPI now, conda soon.

planetarypy / pvl

pvl.load() Hangs Indefinitely on Some PDS3 Files with Attached Labels #104