Closed dpmayerUSGS closed 2 years ago
David, thank you for this wonderfully detailed Issue statement. I'm sure this will help me figure out the problem quickly.
So pvl.load()
first tries to read the whole file as text (an optional encoding can be provided), and if so, then starts parsing it. In general, cube files fail this attempt, and then pvl.load()
evaluates makes a second try and evaluates the file byte by byte, and as long as it can decode the bytes as UTF characters, it stores that text, and when it hits a byte that doesn't decode as UTF (which is typically where the label ends and the "data" begins in attached-label files) it takes that as a stop sign, and then tries to parse the text that it has collected.
As you say, there are very likely some bytes that are causing some shenanigans that I haven't accounted for that is gumming up the works somewhere.
Aha! Well, it wasn't a weird character, the opposite, in fact.
So as I mentioned above, in most "attached" labels, we can't just read all the bytes as UTF characters, so we have to go byte-by-byte through the file until we run out of bytes that we can convert to UTF, and then send the "good" ones to be parsed. However, you won the lottery: you found a .IMG
file for which all of the bytes in that file actually do convert to UTF characters.
So how is that relevant? Well first, that happens. Then all of this "valid" text gets fed through the parser and lexer. It actually happily reads all of the tokens and gets to the "END" token and cleanly parses it. However, in the PVL spec, there is the capability for a final, trailing comment to be placed after the END token. The pvl library doesn't currently have the ability to really do anything with comments other than safely skip them, but I left some hooks in the code for some glorious future when we could do something with comments.
So after the END token is safely parsed, the pvl library asks the lexer for the "next" token. It turns out that while there are "whitespace" UTF characters after the END token, once the "data" starts--which somehow all converts to UTF characters--there are no whitespace characters. So when the parser asks for the "next" token to see if it is a comment (which it currently won't do anything about), the lexer begins analyzing all of the remaining 132 million valid characters to see when it can find some whitespace (its more complicated than that, because it also tests for comments and reserved characters, so it isn't as simple as just calling split()
) so it can return a token to be parsed. I didn't do your test, but either it was still fastidiously lexing characters after four hours or it ran out of memory or something). Anyway, that request for the "next token" after the END does not return in a reasonable amount of time, and that's what you experienced.
The solution is to just remove that request for the "next token" after an END delimiter. If in the glorious future the pvl library does do something fancy with comments, then we'll have to figure out how to deal safely with telling the difference between a valid post-END comment and this wild situation.
I'll work on a patch and get pvl 1.3.2 baked in the next week or so.
In the mean time, you can load the PVL out of this IMG file by doing this:
m = pvl.load("M1232125546LE.IMG", grammar=pvl.grammar.PVLGrammar())
This is because the PVLGrammar() and even the ISISGrammar() are more narrowly interpeting the bytes as ASCII characters instead of the wider UTF characters that the default OmniGrammar() is accepting, so they're not reading 132 million extra characters and just find nothing after the END token and then return quickly.
Thanks for finding this, David!
Fixed in Release 1.3.2. On PyPI now, conda soon.
Describe the bug I routinely use
pvl
to load and parse PDS labels from LROC NAC EDRs. In a tiny minority of cases, I've found thatpvl.load()
will hang indefinitely when trying to read particular files.To Reproduce
pvl
:pvl
never seems to load the label (I waited up to 4 hours in testing, but ~1 minute would be enough)Here's the traceback:
By contrast, if you run
pvl.load()
on say, https://pdsimage2.wr.usgs.gov/Missions/Lunar_Reconnaissance_Orbiter/LROC/EDR/LROLRC_0026/DATA/ESM2/2015365/NAC/M1206236758LE.IMG , the label will load normally.Expected behavior If there's something pathologically bad with the labels in a select minority of LROC NAC EDRs , I would expect
pvl
to recognize the badness and emit an error rather than hanging. Alternatively, if the example product is in fact a valid PDS3 PVL label, I would expectpvl
to load it just as it loads other LROC NAC EDR labels.Your Environment (please complete the following information):
pvl
Version 1.3.1Additional context (Other Things I've Tried) pvl_validate I also ran the commandline utility
pvl_validate
on the example files mentioned above. The program hangs when run on M1232125546LE.IMG, but if I mash Ctrl + c a few times it will printContrast with
pvl_validate M1206236758LE.IMG
, which runs normally and printsProcessing in ISIS The ISIS program
lronac2isis
loads both example files normally and converts them to ISIS cubes correctly. The program isn't pedantic about PDS3 PVL, but the fact that it runs normally wherepvl.load()
gets stuck implies that the labels are minimally viable.Detaching Labels and Parsing with
pvl
I usedless
to peak at the labels of the example products and noted the reported sizes of the label records. I took this information and dumped only the label portion of the products to new files. Interestingly, the label for the bad file says it contains 2 label records of 2532 bytes each, whereas the other example file says it contains a single label record of 5064 bytes. Thus, the reported size of both labels are the same. I don't have enough example data to know if this difference has anything to with the issue I'm reported, but thought it was worth mentioning.pvl_validate
reports the same thing for both detached labels. Similarly,pvl.load()
loads the labels normally when run on the detached labels. So there doesn't seem to be an issue when the labels are detached.How does
pvl.load()
know where to stop reading when the input file has a label attached to an image? Perhaps there's a weird character at the transition between the label portion and the image portion of M1232125546LE.IMG that's causingpvl
to hang?