Possible Error in PML Docs

slanden commented 1 year ago

In https://pml-lang.dev/docs/user_manual, the Anatomy > Attributes section says

Attribute assignments are separated by a space

The Text Processing > Lenient Parsing section says

Quotes around attribute values can be omitted if the value does not contain:

whitespace (' ', '\t', '\r\n', '\n')

any of the following characters: [ ] ( ) " '

So far, so good.

However, the Text Processing > Whitespace Handling > Attributes section, which is not a part of the Lenient Parsing section, shows an example of how you can separate attributes with new lines.

[tag (
    attribute1 = something
    attribute2 = 400
)]

That's the same as:

[tag (\nattribute1 = something\nattribute2 = 400\n)]

That looks like the mentioned Lenient Parsing, but I'm not 100% certain. More importantly though, the Text Processing > Escape Characters > Attributes section then says

Escape sequences are not supported in unquoted attribute values.

And the example:

... assign the value C:\temp\test.txt to attribute path, i.e. path = C:\temp\test.txt

So, given the whitespace rules explained in earlier sections (or maybe just the Lenient Parsing section), this would be interpreted as

path = C:    emp    est.txt)

producing three attributes; path, emp, and est.txt because \t is whitespace.

Is that example only valid in non-lenient parsing, and would need to be quoted in lenient parsing?

pml-lang commented 1 year ago

The following statement in the PML user manual is not correct:

"Attribute assignments are separated by a space."

Instead, it should state:

"Attribute assignments are separated by whitespace (a sequence of one or more spaces, tabs, and new lines)."

Thanks for reporting this bug in the docs.

I've fixed this in my local branch, and the fix will be included in the next PML version 4.0.0 (planned to be published this or next month).

Hence, the following code is ok:

[tag (
    attribute1 = something
    attribute2 = 400
)]

That's the same as: [tag (\nattribute1 = something\nattribute2 = 400\n)]

It's not the same. \nattribute1 would be an invalid attribute name, because \ is not allowed in attribute names (and escaping is not supported in names). The parser generates the following error:

Expecting a valid name. A name cannot start with '\'.

Moreover something\nattribute2 would be parsed as a single attribute value (containing a back-slash, followed by the letter n), because (1) the value is unquoted, and (2) escape sequences are not supported in unquoted values.

And because of (2), the following assignment:

path = C:\temp\test.txt

... is equivalent to:

path = "C:\\temp\\test.txt"

... which means that the value C:\temp\test.txt is assigned to attribute path.

The above example is also shown in the PML User Manual, at the end of chapter Text Processing / Escape Characters / Attributes.

I hope this clears it up.

slanden commented 1 year ago

If it's not the same, how are you differentiating between them in your parser?

If I pass a chunk of PDML source as raw bytes and come across a new line, whether it was in the source implicitly by pressing "enter" or explicitly by typing '\n', in both cases the byte value is the same. You're obviously differentiating somehow if you have the error handling in place for it, I just don't see how..

For example, this byte string

b"
\n

"

gives the following bytes: [10, 10, 10, 10]

pdml-lang commented 1 year ago

The bytes for pressing <Enter> or typing "\n" are different.

When you press <Enter> then:

On Linux you create a single Unicode character "New Line" (U+000A).
On Windows you create two Unicode characters: "Carriage Return" (U+000D), followed by "New Line" (U+000A).

On the other hand, when you type "\n", then you create the following two Unicode characters:

Unicode Character “\” (U+005C)
Unicode Character “n” (U+006E)

The character escape mechanism of the parser converts these two characters into a single "New Line" character (but only if escaping is supported in the given context).

The actual bytes stored in the file depend on the encoding used. PDML and PML both use UTF-8 encoding.

slanden commented 1 year ago

Well, you taught me something new. I'm writing a parser and in all my tests I was simulating input strings as raw byte strings, so when I type '\n' it was automatically becoming a real newline. But, If I were to read in the text from an IO stream, a '\' is automatically placed before any literally typed '\'+'n', to become as you said, '\' + '\' + 'n'.

It's all cleared up now, thanks!

slanden commented 1 year ago

I think there's another error in PDML Extensions User Manual > Syntax Extensions > Attributes > Lenient Parsing

Under the "Warning", I think the '#' should be '@':

[foo [# a1 = "v1"]]

pml-lang commented 1 year ago

the '#' should be '@'

Good catch. It's now fixed. Thanks for reporting this.

pdml-lang / pdml-lang.github.io

Possible Error in PML Docs #13