[Spec] Expression grammar formalization

hexabits commented 6 years ago

The repo needs files which include grammars for the attributes that require evaluation of their strings.

These attributes are:

For <add>:
- arr1
- arr2
- cond
- vercond
- arg (?)

For each of these, the grammar would answer:

What is a valid identifier?
What is a valid number?
What is a valid operation?
What is a valid order of operations?

There are additional things to formalize outside of the grammar itself, such as having multiple grammars specialized for the various attributes. For example, logical operations are not valid in arr1 and arr2, or are they? Since we haven't formalized this, there is no actual answer. Right now the only operations that makes arr1 and arr2 expressions are arithmetic though extremely limited (only 1-2 uses iirc).

I will elaborate on this in the comments, since discussing it at the top of the ticket will be likely overwhelming. :)

Identifier

The rules for what a valid identifier is are fairly simple. It's whatever makes a valid name for any of the XML tags, plus a few reserved keywords like the ARG and TEMPLATE tokens. However the latter does not apply to all of the attributes equally.

Identifier Regex

The current regex matches any sequence of alphanumeric (plus :) words separated by a single space.

Like C and other languages, the identifier cannot start with 0-9. Unlike other languages, underscore is NOT valid for a starting character.
Colon has to be an exception because several FO4 RTTI include colons, e.g. BSConnectPoint::Parents. Colon is also used in some name attributes already.
Dashes/hyphens which were present in name attributes had to be removed, as - is an arithmetic operator.
Question marks which were present in name attributes also had to be removed. There is no reason to make a name of a field a question.

Rigorous

identifierregexrigorous

\b[A-Za-z]+(?:[\:]{0,2}?\s{0,1}?\w)+

Edit: I have added \b to the start as it is more explicit about requiring alpha at the start. Without it it would incorrectly match 0x10000.

Disallows:

Any character except A-Z, a-z, 0-9, _ and :.
Identifiers not starting with an alphabetical character ([A-Za-z])
More than one space between words (2+ spaces is considered separate identifiers)
More than two colons in a row
Trailing or leading whitespace

Simplified

identifierregexsimplified

\b[A-Za-z](?:[\w\:]+\s*)+

Almost exactly half the steps in this case, but the output would have to be trimmed.

Allows:

Trailing colons (BAD)
Trailing space (BAD)-ish
3+ sequential colons (BAD)-ish

Number

Version Regex

Gamebryo versions are considered a number because they are merely a way of representing each component bit shifted into an analogous number, best seen in hex, e.g. 20.2.0.7 => 0x14020007.

It gets complicated because to support versions 2.3, 3.0, 3.03, and 3.1 would require a regex that can also match floats. However, none of these versions are currently used in vercond. They are only referred to in ver1 and ver2 which are not expressions.

Simplest 4-Component or 2+ Component

versionregexsimple2

[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,3}

[0-9]{1,2}\.[0-9]{1,2}[\.]?[0-9]{0,2}[\.]?[0-9]{0,3}

Allows:

Trailing periods (BAD)
3-component versions which do not exist e.g. 3.0.1 (BAD)
Versions that start with a 0
Matches well outside the range of currently valid for each field, e.g. 99.99.99.999
Matches substrings of much larger groups of numbers and periods, e.g. 99.99.99.999.99.99.99.999

Somewhat Rigorous

versionregexkindarigorous

[0-9]{1,2}\.[0-9]{1,2}(?:[\.][0-9]{0,2}[\.][0-9]{0,3})?

Disallows:

Trailing periods
3-component versions e.g. 3.0.1
Versions that start with a 0

Allows:

Matches well outside the range of currently valid for each field, e.g. 99.99.99.999
Matches substrings of much larger groups of numbers and periods, e.g. 99.99.99.999.99.99.99.999

Rigorous

versionregexrigorous

(?<![\w\.])[1-9]{1}[0-9]?\.[0-9]{1,2}(?:\.[0-9]{0,2}\.[0-9]{0,3})?(?![\w\.])

The number of steps (in Python at least) blows up with both a negative lookbehind and lookahead. In PCRE (php) it is only ~500 steps. Uncached, both take about 300ms, but even cached, Python takes ~5ms which is quite slow.

Disallows:

Matching of an entire version if preceded or succeeded immediately by a word character or a dot.

Pedantic

I'm not patient enough to write this one, but it would include minimizing the value ranges of each component to what exists in reality, kind of like limiting IPs to 255.255.255.255 among other limitations.

Numeric Regexes

Decimal

\d+(?!\.)

Any series of 1 or more digits that are not directly succeeded by a dot (in order to not match floats).

Hexadecimal

0[xX][0-9a-fA-F]+

Binary

0b[01]+

Float

Simple

[-+]?[0-9]+\.[0-9]+

floatregexsimple

This version of the regex could erroneously match 4-component Version identifiers.

Rigorous

(?<![\w\.])[-+]?[0-9]*[\.][0-9]+(?:[eE][-+]?[0-9]+)?(?:[fF])?(?![\w\.])

floatregexrigorous2

This still doesn't apply to the rarer formats 0.-0. 0.f -0.f 1e10 1e-5, but no one really writes them that way anyway.

Other Identifiers

TEMPLATE and ARG

These are currently special key words and should formally be considered a token. Their use outside of the intended areas (which also aren't currently defined) should error. These two tokens are also a good example of why vercond at least needs its own separate grammar (aside from arithmetic making no sense there).

TEMPLATE is not actually used in any attributes with expressions, only type and template, but it technically could be. Should it?
- Example: cond="TEMPLATE == NiProperty" i.e. template specialization.
TEMPLATE and ARG should probably take on the token delimiter syntax proposed in #70 . It would make things consistent and all strings could be split on tokens via one single regex

Operators

For ease of reading I will make tables of what is in use and what operators do and do not need to be implemented that aren't already being used.

Arithmetic

Addition, Multiplication, and Division are all used. Subtraction used to be but was actually unnecessary for that field.

In Use	✔️	❌
`+` `*` `/`	`-` `%`	`++` `--` `=`

Logical and Bitwise

In Use	✔️	❌
`&&` `\\|\\|` `!`
`&` `\\|`	`<<` `>>`	`^` `~`

Member Access

With the introduction of BSVertexDesc, a compound that is attempting to mimic a bitfield that spans a 64-bit integer, a child of BSVertexDesc needed to be passed into another compound via the shared parent.

<add name="Vertex Desc" type="BSVertexDesc" />
<add name="Vertex Data" type="BSVertexData" arg="Vertex Desc\Vertex Attributes" />

Most programming languages use . for this, but with our identifiers having spaces, it just looks awkward. (Also, in this particular case, the real answer is to actually introduce a real bitfield type, which is relevant to #3 and the Flags type.)

However, a member access operator is still potentially useful and should be formalized. It could potentially be used nearly everywhere because User Version 2 is actually incorrect. The header should look something like this:

<compound name="BSStreamHeader" versions="#BETHESDA#">
    <add name="Version" type="ulittle32" />
    <add name="Export Info" type="ExportInfo" />
</compound>

<compound name="Header">
    <!-- -->
    <add name="Num Blocks" type="ulittle32" ver1="3.1.0.1" />
    <add name="BSStream" type="BSStreamHeader" cond="#BSVERSIONS#" />
    <!-- -->
</compound>

As this is exactly what happens in their engine (minus the conditions on the header version of course, since it's assumed).

Then anywhere there is User Version 2 it would be replaced with BSStream\Version.

Global Variables

This brings up the next issue, which is globally available variables. Right now, all vercond assume access to the following identifiers in some way or another:

Version
User Version
User Version 2 (i.e. BSStream\Version)

Since they have special meaning, they should probably be tokenized (see #70), and any parser should see the token and replace it with the correct value. This is of course only for vercond--furthering the need for two grammars.

These variables are referenced in cond but only in the Header compound since they are actually local variables in that case. Which brings up the next thing in need of specification: Do we make the version values explicitly unavailable to cond? I could contrive uses for having them in cond (as I mentioned I will illustrate in a comment below), but these mostly involve using ternary expressions, something that we also haven't specified. The important thing is that cond is meant for dynamically changing its size and read/write status based on the values of its antecedent siblings.

Therefore the assumption will be that these global values are only available in the grammar specific to vercond, as special keywords.

Grammar Format/Notation

I have already nearly finished writing the grammars for both cond and vercond using a variant of PEG (Packrat) notation. Specifically, this variant can be parsed by Arpeggio for use by nifxml.py.

We do not have to decide on only one format however. A grammars folder has been added to the repo root and any formats we would like to support can be added to it for each grammar. If the files are not actually parsed directly by a library, etc. then at least the grammar is there for reference.

Grammar Requirements for Each Attribute

Operators

	`cond`	`vercond`	`arr*`	`arg`
Logical	✔️	✔️	❌	❌
Relational	✔️	✔️	❌	❌
Arithmetic	✔️	❌	✔️	✔️
Bitwise	✔️	❌	✔️	✔️꙳
Unary Not (`!`)	✔️	✔️	❌	❌
Unary Minus (`-`)	✔️	❌	❌	❌
Grouping `(` `)`	✔️	✔️	✔️	✔️
Member access	✔️	❌	✔️	✔️

꙳: In the BSVertexDesc case, some masking and bit shifting on a plain int64 inside of arg would have been enough to pass the required information to the compound. So, bitwise could be nice to introduce for arg.

Operands

	`cond`	`vercond`	`arr*`	`arg`
Local Ident	✔️	❌	✔️	✔️
Global Ident꙳	❌	✔️	❌	❌
Version-as-int	❌	✔️	❌	❌
Integer	✔️	✔️	✔️	✔️
Float	✔️	❌	❌	✔️
`#ARG#` ꙳	✔️	❌	✔️	✔️
`#TEMPLATE#`꙳	❌	❌	❌	❌

꙳: Any columns with an ❌ for this identifier/token means that the grammar would implement these as reserved keywords, but do nothing with them (or error, etc.).

As can be seen by comparing the columns, each attribute needs its own specific grammar ~~even excluding the possible support of bitwise ops in arg,~~ array and args still differ in allowed operands.

Edit: It was overlooked that arr1 already uses bitwise operators for UV Sets.

Example Grammars (WIP)

Still largely WIP, and I haven't implemented some operators yet, like modulus.

Cond (Gist)

Currently has the entire superset of all the grammars, i.e. nothing has been implemented that is not allowed in it that is allowed in other grammars.

[ ] Normally global keywords like Version and User Version appear in cond only in the Header compound. These keywords should be reserved and unavailable for use in cond, but that requires morphing them into tokens such as #VER# and #USER# for use in vercond.
[ ] Decide if literal floats be used at all in expressions. Will elaborate below.

There's a very odd case of floating point comparison required for BSLightingShaderProperty for FO4:

<add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ...  />

To assure an exact comparison, the RHS is written as essentially hex bytes, and it is assumed that the expression parser convert the float to hex bytes as well for the LHS.

Should comparing a float with a non-float receive its own special non-terminal in the grammar? This way it is clear that something special is being done.
Should byte-for-byte comparisons receive an entirely separate operator?
Should the use of float literals actually be removed from the grammar as they have no use in arithmetic, etc. (for the purposes of nif.xml)?

Vercond (Gist)

As discussed, only has relational and logical operators. No member access operator, bitwise, or arithmetic operators.

Current Issues:

[ ] Must not allow the entire string to only be a version keyword or a version number, must have both LHS/RHS
[ ] Disallow use of Version, User Version, etc. by reserving as keywords.
[x] Replace them with tokens such as #VER# and #USER# and that way they are unambiguous.
[ ] Alternate version identifier using ID values formalized in #69 in the format VXX_X_X_X. This allows any language, interpreted or generated, to skip converting the XX.X.X.X format to an integer and use the enumeration value directly.

Arr (Gist)

No relational or logical operators. No floating point numbers. Has member access operators, bitwise, and arithmetic.

Arg

Haven't approached this with any grammar yet. Still not totally sure if it will be necessary.

Note

This ticket is WIP and I will be updating it more this evening, and adding the examples of the PEG grammar once I finalize some things. I do have Arpeggio working fully with the grammars I've defined, and it will be replacing the expression parsing in nifxml.py.

I have yet to elaborate on:

What is a valid order of operations? (I currently have *an* order, but order of operations has some subtleties to discuss)
A summary of what is actually changing vs what is just being formalized, i.e. what repercussions this has on existing projects.
The new <version> ID attribute (Issue #69), and it becoming its own identifier format in addition to--or maybe replacing--the XX.X.X.X format used now. Essentially, the version has a unique identifier and can be referenced directly now, even for version ranges, which is simply all the <version> between two specified IDs. See checklists above.
Version vs #VERSION# (or #VER#, etc.) in expressions. That is, we need to make Version, User Version, and BSStream Version reserved keywords or tokens, and also disallow them in cond because it can't have globals. But in the Header compound, they are local and need to be used as locals, and so can't be reserved. Thus all the Version etc. in vercond probably need to be tokenized and I will elaborate on that later. See checklists above.

hexabits commented 6 years ago

Regarding array expressions mentioned in the ticket:

The only valid logical operation I could think of for arrays is the ternary operator. There are limited instances where rows are doubled so that behavior for the array length can change, usually based on version. This isn't actually getting addressed directly in this ticket, but it illustrates my point:

<add name="Triangles" ... arr1="Num Triangles" ver2="10.0.1.2" />
<add name="Triangles" ... arr1="Num Triangles" ver1="10.0.1.3" cond="Has Triangles"  />

Without using equality or comparison operators, arr1 could actually be expressed like this:

Num Triangles * !(Has Triangles)

It could also be (Has Triangles || Version < 10.0.1.3) ? Num Triangles : 0. There is really no reason to do this though, and there are many reasons not to. For one it assumes that "Has Triangles" is still present and False even though it technically doesn't exist before 10.0.1.3. The ternary version assumes you can use non-local variables such as Version. You can't, or shouldn't be able to. It's also not specified anywhere.

hexabits commented 6 years ago

I missed that arr1 actually already uses bitwise operators for UV Sets length and have corrected the issue text and tables.

hexabits commented 6 years ago

I have added example WIP grammars to the main ticket. They were collectively too long so I put them on Gist.

hexabits commented 6 years ago

I have added information to the main ticket regarding floating point literals and their use in expressions.

Mostly, this floating point comparison required for BSLightingShaderProperty for FO4:

<add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ...  />

I believe floating point literals should possibly be removed or explicitly disallowed in the grammar. As for the above condition, it makes for a very problematic edge case. The only thing I can possibly think of is this:

<add name="Backlight Power" type="float" cond="(Rimlight Power ^ 0x7F7FFFFF) == 0" ...  />

Except in C++, etc. bitwise operations are simply not defined on floating point values. This does at least carry more connotation that you are looking at the bits and not the actual floating point value.

We could also special-case the value

3.402823466e+38

And allow that as the only floating point comparison. As far as precision goes, numerically the string seems to be able to vary from 3.4028234 to 3.4028235 in the mantissa and still result in a bytes value of 0x7F7FFFFF.

hexabits commented 6 years ago

ident-grammar-bug

Found a slight issue with my identifier regex involving hex literals. It should only be matching the text, not the integers. The solution was to make sure that hex literal parsing comes before identifier parsing, but I am also using the regex directly to extract identifiers from the expression string for some purposes.

I will explore whether I should explicitly exclude matching of hex integers from the identifier regex.

hexabits commented 6 years ago

ident-grammar-fixed

I just went ahead and put a \b in front of it which is more rigorous anyway. Technically, identifiers in most languages can start with _ but it was already intended to be not so in nif.xml grammar.

Note: Chain\Test\Value 1 is there because identifier chain matches are constructed with a sequence of the identifier regex itself. Identifier chains are already matched before single identifiers in the grammar.

Candoran2 commented 2 years ago

For the exact comparison with floats, I don't think it makes sense to write the hex bytes, since there is no indication that it is a float (aside from the data type of the field, but I don't think there should be special names for any types). The most logical choice to me to exactly represent floats would be the hexadecimal literals as explained here:

https://en.wikipedia.org/wiki/IEEE_754#Character_representation

In order to make a distinction between the various float precisions, that would have to be implied by the precision of the hexadecimal number.

niftools / nifxml