planetarypy / pvl

Python implementation of PVL (Parameter Value Language)
BSD 3-Clause "New" or "Revised" License
19 stars 19 forks source link

ISISEncoder should change the default delimiter to `\n` #106

Closed acpaquette closed 1 year ago

acpaquette commented 1 year ago

Describe the bug When PVL is written out using the ISIS encoder, the end_delimiter defaults to false. This can create problems when your writing out ISIS compliant headers followed by other data. See #98

To Reproduce Again see #98

Expected behavior When you output pvl data using the ISIS encoder, the end_delimiter should be enabled by default and the delimiter should be set to \n to be consistent with ISIS.

Your Environment (please complete the following information):

Additional context This issue is related to:

98 and This ISIS issue

rbeyer commented 1 year ago

I think maybe there is a misunderstanding about the nature of #98, and the meaning of the end_delimiter boolean parameter in the context of the pvl library and the PVL specification.

The end_delimiter boolean parameter to a pvl encoder object signals whether every PVL statement in the output PVL-text should end with the semi-colon character or not. So the "end" in end_delimiter is about every PVL statement within a module, and not just about the end of the PVL-text string.

The PVL specification does not allow for any character to replace the semi-colon in this role at the end of every PVL statement. Either the semi-colon is present at the end of every statement (end_delimiter=True) or not (end_delimiter=False). And that's why end_delimiter is implemented as a boolean and doesn't just take any old string.

The way that the pvl module has been written does technically allow one to change that semi-colon to anything by forcibly changing the first element of the grammar's delimiters list. However, no grammars in the pvl library override that semi-colon, and changing it to something else may have unintended consequences for encoding by the pvl library and consumption by anything else, but you can certainly test that.

I fully appreciate that the PVL-text syntax that ISIS programs consumes does not particularly conform to the PVL, ODL, or PDS3 standards. However, I would be very hesitant to change the default delimiter for the ISISGrammar, as I suspect it would result in additional blank lines within the PVL-text emitted, and may have other consequences beyond this use-case.

If you really only need an additional line feed character at the very end of the PVL-text, you could instead forcibly change the first element of the ISISGrammar's end_statements list from "END" to "END\n". However, that seems pretty heavy-handed, when you can just take the string already returned by pvl.dumps(), and just add the line feed character to that for the use case when you need a line feed character there.

acpaquette commented 1 year ago

This is fair, ISIS just adds a newline character as the end_delimiter in all cases which is partly why the ISIS cube labels have new line after every new object or group in the PVL header. This is also why I original posted the above, I do agree that we should not make the above proposed change/bug. However, it seems like we are missing part of the PVL spec regardless of the end_delimiter. As it states in the spec:

"The character immediately following the END statement must be either an optional spacing character or format effector, such as a space, line feed, carriage return, etc."

This seems to be absent from the PVLEncoder and subsequently the ISISEncoder. Taking a look, it seems like this line that is joining the lines of PVL text doesn't add a newline character to the final line. Experimenting in a jupyter notebook confirms the behavior. Is this valid to the spec that the PVLEncoder is following?

Screen Shot 2023-02-23 at 12 47 05 PM

I guess I am confused as to whose responsibility this should be. The PVL header produced using the ISISEncoder is producing PVL that cannot be read back through the pvl package without the change to add the \0 character to the list of reserved characters in the ISISEncoder.

Consequently, it seems like any PVL produced using the PVLEncoder with the end_delimiter set to false would also produce the same behavior. Where the load would fail if the PVL header was followed by other data. Is enforcing that the PVLEncoder add a newline character after the join seem like a reasonable fix and within the spec?

rbeyer commented 1 year ago

There are many reasons why PVL never caught on. It's ... not good. Unfortunately, we have to live with it in planetary, and with its many mutations over the last two plus decades. You say, "it seems like we are missing part of the PVL spec regardless of the end_delimiter." Yeah, it is confusing, so forgive me for being extra-pendantic here. When I say "the PVL specification" I am referring to the CCSDS Blue Book, which I have linked from the Standards & Specifications page of the pvl module's documentation. And I'll quote its sections below.

The PVL "END STATEMENT" is defined under section 2.5, and says the following: "The End Statement is delimited by one of the following: a semicolon; the first White Space Character; the end of a Comment; or the end of the provided octet space" (which is just a 90s way to say byte stream, I guess). I couldn't find the text you quoted in the CCSDS book. This quote from the spec says something different than your statement. Firstly, if there is a character after the END statement, it must be white space (defined as your statement indicates), but can also be a valid PVL comment (delimited by /* and */ C-style comments) or a semi-colon. However, importantly, the spec also says that there doesn't need to be a character after END, the "octet space" can just end, and that is also a valid "END STATEMENT." What does that mean for the pvl library? Well, this Python pvl library does not yet handle writing out comments, so we can ignore that option. This leaves us with three completely valid ways to end a stream of PVL-text: (1) with the "D" in "END" being "the last thing", (2) with any kind of whitespace after an "END", (3) or with "END;".

So to answer your very specific question that you tested, when you call pvl.dumps(some_module) which is truly just a user-facing simplification of running pvl.encoder.encode(some_module), it returns a Python str. The end of that string is the end of the "octet space", and so having that string end with just "END" absolutely conforms to the PVL spec. If, instead, you call pvl.dump(some_module) the file that gets written also has the "D" in "END" as its last character, which is again the end of the "octet space" and is also valid. If you instantiate a pvl encoder and pass end_delimiter=True then the pvl module will place semicolons at the end of all of the statements in the PVL-text, and also after the "D" in "END" in the string or file that it returns.

So to your question of whose responsibility this should be, I don't think there is any way for the pvl module to know whether a user is going to add additional information to some data construct after the valid PVL-text that it emits. The library provides the ability for a user to request the semi-colon endings on the returned PVL-text. It seems to me that if a user of pvl is going to add additional data in a file or stream of data after the PVL-text, then it is the user's responsibility to either emit strictly conformant PVL with semi-colon terminated statements or manually add the white space that they want after the PVL-text before additional data is added to the file.

That's my general position on the PVL "dialects" that have a standards document (PVL, ODL, and PDS3), but the entire existence of the "ISIS" dialect in the Python pvl library is meant to handle parsing of potentially wonky ISIS-emitted PVL-text that doesn't conform to the other dialects, and to produce PVL-text that can be consumed by ISIS programs. The problem is that there is no consistent specification, so making choices about the "defaults" for the ISIS dialect is fraught with danger, because you might solve one problem, but generate others.

Issue #98 was about the pvl library's consumption of ISIS-emitted PVL text, and the solution was to make the parser also accept the ASCII NULL character as a valid termination character if encountered after an "END" but it does not address emitting PVL-text for consumption by ISIS, which is what this Issue is about.

So, we could certainly change the pvl ISISEncoder object such that PVL-text returned by encode() (and thus also dump() and dumps()) always ends with a space or a line feed (\n). You're the ISIS developer: if that will always be superior to not doing it, then we should absolutely make that change. However, if you're not sure about whether that might break some ISIS-consumption, then maybe it is better to leave it like it is, and let users add the whitespace they want/need to the returned PVL-text when they are going to add additional, non-PVL data to it. I'm honestly not sure.

acpaquette commented 1 year ago

I pretty much agree with this. The one thing I would like to dispel is that the PVL coming out of ISIS is not the issue here. It is the PVL coming out of the pvl packages dump function using the ISIS encoder. See below:

encoder = pvl.encoder.ISISEncoder(end_delimiter=False)

# Create the PVL to write

pvl.dumps(pvl_to_write, encoder=encoder)

If the output pvl is used as a header in a file with other data like we were seeing it will produce the error encountered in #98. I think the reasoning behind manually adding a \n character makes sense to me and is a good solution to add to plio.

rbeyer commented 1 year ago

I'm not sure you need to dispel anything. You say:

The one thing I would like to dispel is that the PVL coming out of ISIS is not the issue here. It is the PVL coming out of the pvl packages dump function using the ISIS encoder.

And earlier I said:

... it [referring to Issue #98] does not address emitting PVL-text for consumption by ISIS, which is what this Issue is about.

So we agree that this issue is about the PVL-text produced by the pvl library.