Add plaintext description field to Package.

nuclearsandwich commented 2 years ago

This patch adds a _get_node_text function to the package module's xml processing toolset: a basic plaintext renderer for xhtml descriptions which also extracts text from formatting tags such as <em> or <a> although it does not render or preserve the href attribute on a tags in any way. All literal whitespace is compressed into a single ` so the only way to insert a newline into the plaintext description is by using a
` tag.

I used this to add a new plaintext_description field to a parsed package where I take the extra step of trimming whitespace around newlines even though they are technically valid it doesn't match the expectations established by first writing an xml description and then hand-rendering it as plaintext.

Motivation and background

There are several cases where a package.xml description is meant to be "rendered" as plain text. According to REP-149 the description "may contain XHTML" but depending on where the description is used "XML tags and multiple whitespaces might be stripped".

The existing description field does not do any sort of whitespace manipulation, nor does it actually strip XHTML tags. This results in extremely odd formatting when used in plaintext contexts, such as a PKG-INFO file within a python sdist.

Consider the description for actionlib

  <description>
    The actionlib stack provides a standardized interface for
    interfacing with preemptable tasks. Examples of this include moving
    the base to a target location, performing a laser scan and returning
    the resulting point cloud, detecting the handle of a door, etc.
  </description>

This is currently parsed as

'The actionlib stack provides a standardized interface for\n    interfacing with preemptable tasks. Examples of this include moving\n    the base to a target location, performing a laser scan and returning\n    the resulting point cloud, detecting the handle of a door, etc.'

All interior whitespace is retained which results in an oddly formatted description since the line breaks do not have any inherent semantic value but appear to be formatted just to adhere to 72-character line widths.

When discussing #316 and #326 both of which are proposals to manipulate the description in order to conform to changes in setuptools each proposal assumed that lines would be "sensibly formatted" to start with and that taking either the first "line" or replacing line breaks with single spaces would retain a sensibly formatted, if truncated description. However if we took only the first line of the above description (as in #326) it would yield:

The actionlib stack provides a standardized interface for

If we we compressed newlines into spaces and then took the first 200 characters (as in #316) it would yield:

The actionlib stack provides a standardized interface for     interfacing with preemptable tasks. Examples of this include moving     the base to a target location, performing a laser scan and retu...

While trying to decide which of these is the "lesser of two unfortunates" I got it into my head that we should really be treating these like an HTML renderer does and treating any combination of whitespace as a single space. I turned to the XML specification to back me up on this. From section 2.10 on white space handling

An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.

Which prompts me to conclude that catkin_pkg is our application and that we should be deciding how to treat whitespace. The description in REP-149 specifically states that whitespace may be stripped in certain situations and I think that a plaintext "rendering" of an xhtml description field like actionlib's should result in the following string:

The actionlib stack provides a standardized interface for interfacing with preemptable tasks. Examples of this include moving the base to a target location, performing a laser scan and returning the resulting point cloud, detecting the handle of a door, etc.

nuclearsandwich commented 2 years ago

Thanks for the initial review @cottsay I ought to've clarified that I planned to do some iterating in this PR rather than just merge an as-yet-unused test fixture. I should have opened this PR in draft and have now flipped it to one.

nuclearsandwich commented 2 years ago

The latest push gets us closer but still not home. There are some small whitespace discrepancies between what I've got and our intended rendering of the text which I need to figure out if I can handle in a reasonable way or if we have to make concessions in our expectations. Incidentally, I moved the processing of the plaintext_description into the xml processing phase since it's much easier to do this on an extant DOM rather than trying to do something at runtime. Since I'm storing this in a plaintext_description attribute I think we can probably cut the getter function and rewrite the test to just check the attribute.

    """
E   assert ('A package with an XHTML description. \n'\n '\n'\n ' This package contains several xhtml tags which are, according to REP-149, '\n 'meant to be properly handled but "XML tags and multiple whitespaces" may be '\n 'stripped in some situations. Another sentence in this quasi-paragraph will '\n 'continue to appear on the same plaintext line because there was no <br/> tag '\n 'to indicate a newline should appear in the output text. \n'\n ' This text should appear on a subsequent line.') == ('A package with an XHTML description.\n'\n '\n'\n 'This package contains several xhtml tags which are, according to REP-149, '\n 'meant to be properly handled but "XML tags and multiple whitespaces" may be '\n 'stripped in some situations. Another sentence in this quasi-paragraph will '\n 'continue to appear on the same plaintext line because there was no <br/> tag '\n 'to indicate a newline should appear in the output text.\n'\n 'This text should appear on a subsequent line.')
E     - A package with an XHTML description.
E     + A package with an XHTML description.
E     ?                                     +
E
E     - This package contains several xhtml tags which are, according to REP-149, meant to be properly handled but "XML tags and multiple whitespaces" may be stripped in some situations. Another sentence in this quasi-paragraph will continue to appear on the same plaintext line because there was no <br/> tag to indicate a newline should appear in the output text.
E     +  This package contains several xhtml tags which are, according to REP-149, meant to be properly handled but "XML tags and multiple whitespaces" may be stripped in some situations. Another sentence in this quasi-paragraph will continue to appear on the same plaintext line because there was no <br/> tag to indicate a newline should appear in the output text.
E     ? +                                                                                                                                                                                                                                                                                                                                                                     +
E     - This text should appear on a subsequent line.
E     +  This text should appear on a subsequent line.
E     ? +

test/test_package.py:388: AssertionError

nuclearsandwich commented 2 years ago

As much as it bugs me, the result of the current processing is actually, more or less, what an html plaintext rendering would produce. I think I do want to try and strip spaces around newlines but I may not keep it unless I can find a way to prove it doesn't misbehave.

nuclearsandwich commented 2 years ago

There's a level that I don't think that we should do a character by character test for all the parsed outputs. But maybe just a few checks for specific things that are properly escaped etc would make sense. And we can trust that the specific behaviors are well defined by the parser that we're relying upon and we can just spec that we'll follow the parser.

There's a level that I don't think that we should do a character by character test for all the parsed outputs. But maybe just a few checks for specific things that are properly escaped etc would make sense.

The description I put in the test is meant to be a sort of "catchall canary" for all the stuff I think we should care about such as "flattening" text values from tags, compressing whitespace according to html rendering convention, and honoring <br> tags as newlines in the result.

I expect this description to grow into an arcane horror as time goes on. Not the easiest thing to test individual facets but perhaps easier than separate fixtures for each element (and testing them in combination has its own benefit.)

And we can trust that the specific behaviors are well defined by the parser that we're relying upon and we can just spec that we'll follow the parser.

XML itself makes no assertions about whitespace except that it should be handed over intact to the application which may decide what to do with it and in general should not be expected to be preserved.

From the XML specification section 2.10 on white space handling:

An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.

In this case, catkin_pkg is the application and should decide what to do about whitespace. The existing description is being left as-is, I'm not entirely certain if the way the description is stored whether or not it is usable as an xml fragment so we may be back later to add an xml_description field. index.ros.org doesn't use catkin_pkg at all and is using a generic XML parser to extract the description and render it as html so that isn't a reliable indicator that the current description is correct in that case although the years without complaint means at least no one is stressing out about it.

This also means that catkin_pkg should determine how best to represent the description in plaintext contexts and I think this is a good first try. It'd be interesting to do other things like render emphasis or links using markdown style but that rapidly moves into territory that pushes the boundaries of "plain text" and leaves room for difference of opinion vs just extracting text content recursively from child nodes and doing something "sensible" with whitespace.

ros-infrastructure / catkin_pkg

Add plaintext description field to Package. #329

Motivation and background