rfc-format / draft-iab-xml2rfc-v3-bis

-bis document for the RFC format xml2rfc v3 draft
6 stars 10 forks source link

clarify use of <artwork> vs <sourcecode> #195

Open reschke opened 3 years ago

reschke commented 3 years ago

It would be good to have more guidance about the distinction of these, and what categories they apply to.

(Currently their formatting is roughly the same (and I believe we may want to relax the try-to-keep-on-one-page rule for \ as opposed to \.)

In particular, what element should be used for...:

Note that both elements have a "type" attribute. Are the space of values for these really separate? What would it mean for "x" being defined both as an artwork and sourcecode type?

cabo commented 3 years ago

The namespaces for the types need to be the same, and we should promote content-type (media-type-name plus possibly parameters) for formats that have them.

More on artwork vs. source-code in https://mailarchive.ietf.org/arch/msg/rfc-markdown/-TMLuKxSopzSbB3z90ipkPE4AKU

reschke commented 3 years ago

... and we should promote content-type (media-type-name plus possibly parameters) for formats that have them...

:-) ... restoring what RFC 7749 said about it (https://greenbytes.de/tech/webdav/rfc7749.html#element.artwork.attribute.type)

levkowetz commented 3 years ago

The namespaces for the types need to be the same, and we should promote content-type (media-type-name plus possibly parameters) for formats that have them.

The type attribute of the two elements have completely different meaning, and the permitted values for <artwork> is very constrained (because of the meaning of the attribute, and how it affects the code to deal with the artwork), while the acceptable type values for <sourcecode> aren't constrained at all by tooling, and should not be. I don't see how it makes sense to force them into the same namespace.

reschke commented 3 years ago

because of the meaning of the attribute, and how it affects the code to deal with the artwork

Could you please elaborate a bit on that? (this needs to be understood and discussed...)

FWIW, as long as there's a gray area whether to use \ or \, treating the space of types as separate is really asking for trouble.

jrlevine commented 3 years ago

When you say same namespace, does that mean that the types for the two are different but the names can't overlap, or that you can use any defined type for either? The former seems obvious, the second unworkable.

reschke commented 3 years ago

Mainly the former. Given the fact that the distinction between the uses is not totally clear (see above), overlapping names for different things would be bad. That does not imply that any given type needs to be allowed in both elements.

(this is similar to the names of HTTP transfer codings and content codings, see https://greenbytes.de/tech/webdav/rfc7231.html#rfc.section.8.4.1.p.2)

reschke commented 3 years ago

It's interesing to compare the initial list of types (https://greenbytes.de/tech/webdav/rfc7991.html#element.sourcecode.attribute.type) with https://www.rfc-editor.org/materials/sourcecode-types.txt.

Observations:

reschke commented 3 years ago

Survey of types used by the RFC Production Center: https://gist.github.com/reschke/28318b8499746d211d9cfcfed4149af1

Some of the uses, notably with empty type attribute, are really scary. For instance: https://www.rfc-editor.org/rfc/rfc8783.html#section-3.4, where the sourcecode element essentially carries a definition list.

cabo commented 3 years ago

On 2021-02-19, at 17:12, Julian Reschke notifications@github.com wrote:

• there is zero documentation of what a type name is referring to • there seem to be entries where the use of appears rather far-fetched ("http-message"? "test-vectors”?)

Again, the sole difference between the elements is whether the content displayed is for human consumption only or for machine consumption. This is intent, somewhat orthogonal to the type of the content.

Clearly, there is an exception for “svg”, which is text (or element content!?) for machine consumption by the tooling, not by the machines set up by the user of the RFC.

test-vectors is used in RFC 8696, RFC 8734, RFC 8891. This appears to be a weird mixture of hexdumps, JSON-like code, and PEM. It would need type information to enable machine processing.

hex-dump (which would be type information) was only ever used for base64 content (in RFC 8688) before we started using it (for annotated hex dumps) in RFC 8949. These are machine-processable, but mainly intended for human consumption, so at the time we opted for artwork.

I don’t think we can claim there is a system to this yet.

Grüße, Carsten

reschke commented 3 years ago

Again, the sole difference between the elements is whether the content displayed is for human consumption only or for machine consumption.

That's your take, not backed by the spec.

Another problem is that this implies that it's always clear what is for "machine consumption". You said yourself that HTTP message examples are not, yet there is scripting that validates things labeled "http-message".

cabo commented 3 years ago

On 2021-02-19, at 18:01, Julian Reschke notifications@github.com wrote:

Again, the sole difference between the elements is whether the content displayed is for human consumption only or for machine consumption.

That's your take, not backed by the spec.

Well, I’m trying to apply common meanings (e.g., [1]) to the otherwise undefined words used in the spec.

Another problem is that this implies that it's always clear what is for "machine consumption". You said yourself that HTTP message examples are not, yet there is scripting that validates things labeled "http-message".

I actually have had scripts that validate (or produce!) the English text in the sections. Machine usage in the production process is not what I meant (the whole XML file is source code!).

If there is an intention that the user of the spec (as opposed to its author or people reviewing it in the adoption process) be able to perform a copy-paste (or a more fancy xpath extraction) and use the result as machine-readable input, it’s source code.

Grüße, Carsten

[1]: https://en.wikipedia.org/wiki/Source_code: In computing, source code is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source code. The source code is often transformed by an assembler or compiler into binary machine code that can be executed by the computer. The machine code might then be stored for execution at a later time. Alternatively, source code may be interpreted and thus immediately executed.

reschke commented 3 years ago

FWIW, RFC 8949 uses hex-dump, but on artwork, not sourcecode.

reschke commented 3 years ago

If there is an intention that the user of the spec (as opposed to its author or people reviewing it in the adoption process) be able to perform a copy-paste (or a more fancy xpath extraction) and use the result as machine-readable input, it’s source code.

And "pseudocode" falls into that category...?

cabo commented 3 years ago

On 2021-02-19, at 18:12, Julian Reschke notifications@github.com wrote:

If there is an intention that the user of the spec (as opposed to its author or people reviewing it in the adoption process) be able to perform a copy-paste (or a more fancy xpath extraction) and use the result as machine-readable input, it’s source code.

And "pseudocode" falls into that category...?

Well, anything with “pseudo” in its name has problems neatly falling into categories :-)

(I have written pseudocode before that is close enough to common programming languages that it is a small matter of massaging to make it machine processable. The intent may very well be to solve the hard problems of coding and leave the distracting programming language ceremony to the users. The code at the top of page 4 of RFC 7396 is almost, but not exactly, Python.)

Grüße, Carsten

reschke commented 3 years ago

And then there's the problem of mislabeled \, maybe because it doesn't have any documentation. Example: https://www.rfc-editor.org/rfc/rfc8982.html#section-a.1

reschke commented 3 years ago

Or cases where name and type have been mixed up:

          <sourcecode name="http-message" type=""><![CDATA[
  HTTP/1.1 400 Bad Request
  Content-Language: en-US
  Content-Type: application/json

  {
    "err": "invalid_key",
    "description": "Key ID 12345 has been revoked."
  }
]]></sourcecode>
mnot commented 3 years ago

@reschke what's that latter example from?

reschke commented 3 years ago

https://www.rfc-editor.org/rfc/rfc8935.html#section-2.3

Looking at that, another issue comes to mind: people indent sourcecode with leading whitespace, but for some "languages", that makes the "code" actually incorrect (like here).

mnot commented 3 years ago

...and they could have just used RFC7807...

cabo commented 3 years ago

On 24. Feb 2021, at 06:10, Julian Reschke notifications@github.com wrote:

https://www.rfc-editor.org/rfc/rfc8935.html#section-2.3 https://www.rfc-editor.org/rfc/rfc8935.html#section-2.3 Looking at that, another issue comes to mind: people indent sourcecode with leading whitespace, but for some "languages", that makes the "code" actually incorrect (like here).

Care to elucidate? (A.k.a., I don’t get it.)

Of course, there is the RFC 7386/7396 disaster to remind us that indentation must be preserved properly. (There never should be an ASCII HT, “TAB”, in an RFC.)

The bap tool has some interesting mandates on leading whitespace where I’m not sure how they are rooted in RF`C 5234. But wholesale indentation of an entire ABNF spec is not a problem with bap.

Grüße, Carsten

reschke commented 3 years ago

Leading whitespace is forbidden in HTTP/1.1 messages (request line, status line).

In ABNF, consistent leading whitespace is tolerated by BAP, but (AFAIR) not allowed by RFC 5234.

cabo commented 3 years ago

On 24. Feb 2021, at 08:47, Julian Reschke notifications@github.com wrote:

Leading whitespace is forbidden in HTTP/1.1 messages (request line, status line)

I would have expected readers to be able to abstract that out. (The wall to the left of it is not part of the painting in my living room either.)

Grüße, Carsten

reschke commented 3 years ago

Readers yes, tools not necessarily (without tinkering).

jrlevine commented 3 years ago

This brings us back the the question of whether is for actual machine parsable source code, or for anything that looks like code rather than a picture.

cabo commented 3 years ago

On 2021-02-24, at 09:58, Julian Reschke notifications@github.com wrote:

Readers yes, tools not necessarily (without tinkering).

The tools should now work with the XML and get the real sourcecode, not the rendered one. Solved…

(At least partially, @markers is probably not all processing advice that we’ll need.)

Grüße, Carsten

reschke commented 3 years ago

Hm?

Even if you extract the sourcecode from the XML, if it has leading whitespace, and the language does not allow it, processing will fail.

mnot commented 3 years ago

It would be great if we could settle the question of what the actual difference between artwork and sourcecode is; I have specs to ship :)

7991 is actually pretty clear:

[sourcecode] is thus useful for source code and formal languages (such as ABNF [RFC5234] or the RNC notation used in this document). (When is a child of other elements, it flows with the text that surrounds it.) Tab characters (U+0009) inside of this element are prohibited.

For artwork such as character-based art, diagrams of message layouts, and so on, use the element instead.

That seems to support the RPC's seeming preference for sourcecode over artwork for not only computer languages and ABNF, but anything with a formal, structured syntax (including HTTP messages; we chose http-message IIRC because it was inappropriate to use message/http to denote a partial message, such as a single header field). In this view, artwork is only suitable for things that are free-form and unstructured, like drawings and diagrams.

I'd be more comfortable if I know how they were practically different -- e.g., are they displayed differently? Does some other software treat them differently? Still, I think this issue could be closed without action, or at most with some editorial work adding more context to sourcecode and artwork about appropriate use.

reschke commented 3 years ago

Well, right now the RPC seems to prefer \ even in other cases, see for instance https://github.com/rfc-format/draft-iab-xml2rfc-v3-bis/issues/195#issuecomment-782193551 - so clarification is needed in any case.

As for differences one might argue that the "keep on single page" requirement probably should be stronger for artwork?

(I agree that this is mostly editorial except maybe for the type attribute issue)

cabo commented 3 years ago

On 2021-03-04, at 07:00, Mark Nottingham notifications@github.com wrote:

That seems to support the RPC's seeming preference for sourcecode over artwork for not only computer languages and ABNF, but anything with a formal, structured syntax (including HTTP messages; we chose http-message IIRC because it was inappropriate to use message/http to denote a partial message, such as a single header field). In this view, artwork is only suitable for things that are free-form and unstructured, like drawings and diagrams.

I actually prefer to use \ for bad source code (i.e., source code that I do not expect the reader to feed into a compiler or other machine processing).

I have cooked up types such as cddl;bad for this purpose; would be good to have a convention.

Grüße, Carsten

stpeter commented 3 years ago

Well, right now the RPC seems to prefer even in other cases, see for instance #195 (comment) - so clarification is needed in any case.

I talked with RPC folks about this, and their understanding had been that <artwork> is for something that will requires visual presentation, such as a diagram or old-fashioned ASCII art. Everything else is <sourcecode> even if it is not machine-readable or even actual code.

And yes this needs to be clearly documented, communicated, and "enforced", although the distinction is already drawn in RFC 7991 (when <sourcecode> was introduced).

stpeter commented 3 years ago

In addition, AIUI the type attribute on <sourcecode> is nice but not necessary, and really no more than a hint. The RPC has added more values for the type attribute over time, as documented at https://www.rfc-editor.org/materials/sourcecode-types.txt - but this list is not intended to be exhaustive.

jrlevine commented 3 years ago

The point of the sourcecode tag list is mostly to be sure that the tag names are used consistently.

We'd need a lot more than tagging for it to be generally useful to extract sourcecode from the XML. It'd need to say what order to paste pieces together if it mattered, what version of the language (python2 vs python3) and a lot more. I don't think that is worth the effort. A few kinds of automatic extraction are useful, notably ABNF where the order doesn't matter, but that doesn't generalize.

reschke commented 3 years ago

I talked with RPC folks about this, and their understanding had been that is for something that will requires visual presentation, such as a diagram or old-fashioned ASCII art. Everything else is even if it is not machine-readable or even actual code.

Good to know - but that is not backed by the spec, right?

jrlevine commented 3 years ago

Well, 7991 says what it says. Sourcecode is "This element allows the inclusion of source code into the document" and artwork is "For artwork such as character-based art, diagrams of message layouts, and so on, use the element instead." I suppose we could add more detail but the current usage appears consistent with the existing text.

stpeter commented 3 years ago

Because there might be a need for further clarification in the spec, I'll add a will-document label to this issue.

cabo commented 3 years ago

FYI, what we did in RFC8990:

Note that the <CODE BEGINS> for the latter is visible in the renderings, but not in the result of

xpath rfc8990.xml "//sourcecode[@name='grasp.cddl']/text()"

So, for extraction, we have essentially replaced the use of type= (which can still be used for a source code highlighter, except for the artwork cases where that would need to do further inference) with name=. @type='cddl' is still useful for extracting all CDDL and doing a consistency check between the exposition material and the normative CDDL specification.

I'm not sure I can derive more general rules from this result, but it points in a direction that is both compatible with the letter of certain current XML specification documents and still provides some function.