niftools / nifxml

A repository for the nif.xml file, which contains the nif file format description.
http://www.niftools.org
GNU General Public License v3.0
37 stars 43 forks source link

[Spec] Expression grammar formalization #73

Open hexabits opened 6 years ago

hexabits commented 6 years ago

The repo needs files which include grammars for the attributes that require evaluation of their strings.

These attributes are:

For each of these, the grammar would answer:

  1. What is a valid identifier?
  2. What is a valid number?
  3. What is a valid operation?
  4. What is a valid order of operations?

There are additional things to formalize outside of the grammar itself, such as having multiple grammars specialized for the various attributes. For example, logical operations are not valid in arr1 and arr2, or are they? Since we haven't formalized this, there is no actual answer. Right now the only operations that makes arr1 and arr2 expressions are arithmetic though extremely limited (only 1-2 uses iirc).

I will elaborate on this in the comments, since discussing it at the top of the ticket will be likely overwhelming. :)

Identifier

The rules for what a valid identifier is are fairly simple. It's whatever makes a valid name for any of the XML tags, plus a few reserved keywords like the ARG and TEMPLATE tokens. However the latter does not apply to all of the attributes equally.

Identifier Regex

The current regex matches any sequence of alphanumeric (plus :) words separated by a single space.

Rigorous

identifierregexrigorous

\b[A-Za-z]+(?:[\:]{0,2}?\s{0,1}?\w)+

Edit: I have added \b to the start as it is more explicit about requiring alpha at the start. Without it it would incorrectly match 0x10000.

Disallows:

Simplified

identifierregexsimplified

\b[A-Za-z](?:[\w\:]+\s*)+

Almost exactly half the steps in this case, but the output would have to be trimmed.

Allows:

Number

Version Regex

Gamebryo versions are considered a number because they are merely a way of representing each component bit shifted into an analogous number, best seen in hex, e.g. 20.2.0.7 => 0x14020007.

It gets complicated because to support versions 2.3, 3.0, 3.03, and 3.1 would require a regex that can also match floats. However, none of these versions are currently used in vercond. They are only referred to in ver1 and ver2 which are not expressions.

Simplest 4-Component or 2+ Component

versionregexsimple2

[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,3}

[0-9]{1,2}\.[0-9]{1,2}[\.]?[0-9]{0,2}[\.]?[0-9]{0,3}

Allows:

Somewhat Rigorous

versionregexkindarigorous

[0-9]{1,2}\.[0-9]{1,2}(?:[\.][0-9]{0,2}[\.][0-9]{0,3})?

Disallows:

Allows:

Rigorous

versionregexrigorous

(?<![\w\.])[1-9]{1}[0-9]?\.[0-9]{1,2}(?:\.[0-9]{0,2}\.[0-9]{0,3})?(?![\w\.])

The number of steps (in Python at least) blows up with both a negative lookbehind and lookahead. In PCRE (php) it is only ~500 steps. Uncached, both take about 300ms, but even cached, Python takes ~5ms which is quite slow.

Disallows:

Pedantic

I'm not patient enough to write this one, but it would include minimizing the value ranges of each component to what exists in reality, kind of like limiting IPs to 255.255.255.255 among other limitations.

Numeric Regexes

Decimal

\d+(?!\.)

Any series of 1 or more digits that are not directly succeeded by a dot (in order to not match floats).

Hexadecimal

0[xX][0-9a-fA-F]+

Binary

0b[01]+

Float

Simple

[-+]?[0-9]+\.[0-9]+

floatregexsimple

This version of the regex could erroneously match 4-component Version identifiers.

Rigorous

(?<![\w\.])[-+]?[0-9]*[\.][0-9]+(?:[eE][-+]?[0-9]+)?(?:[fF])?(?![\w\.])

floatregexrigorous2

This still doesn't apply to the rarer formats 0.-0. 0.f -0.f 1e10 1e-5, but no one really writes them that way anyway.

Other Identifiers

TEMPLATE and ARG

These are currently special key words and should formally be considered a token. Their use outside of the intended areas (which also aren't currently defined) should error. These two tokens are also a good example of why vercond at least needs its own separate grammar (aside from arithmetic making no sense there).

  1. TEMPLATE is not actually used in any attributes with expressions, only type and template, but it technically could be. Should it?

    • Example: cond="TEMPLATE == NiProperty" i.e. template specialization.
  2. TEMPLATE and ARG should probably take on the token delimiter syntax proposed in #70 . It would make things consistent and all strings could be split on tokens via one single regex

Operators

For ease of reading I will make tables of what is in use and what operators do and do not need to be implemented that aren't already being used.

Arithmetic

Addition, Multiplication, and Division are all used. Subtraction used to be but was actually unnecessary for that field.

In Use ✔️
+ * / - % ++ -- =

Logical and Bitwise

In Use ✔️
&& \|\| !
& \| << >> ^ ~

Member Access

With the introduction of BSVertexDesc, a compound that is attempting to mimic a bitfield that spans a 64-bit integer, a child of BSVertexDesc needed to be passed into another compound via the shared parent.

<add name="Vertex Desc" type="BSVertexDesc" />
<add name="Vertex Data" type="BSVertexData" arg="Vertex Desc\Vertex Attributes" />

Most programming languages use . for this, but with our identifiers having spaces, it just looks awkward. (Also, in this particular case, the real answer is to actually introduce a real bitfield type, which is relevant to #3 and the Flags type.)

However, a member access operator is still potentially useful and should be formalized. It could potentially be used nearly everywhere because User Version 2 is actually incorrect. The header should look something like this:

<compound name="BSStreamHeader" versions="#BETHESDA#">
    <add name="Version" type="ulittle32" />
    <add name="Export Info" type="ExportInfo" />
</compound>

<compound name="Header">
    <!-- -->
    <add name="Num Blocks" type="ulittle32" ver1="3.1.0.1" />
    <add name="BSStream" type="BSStreamHeader" cond="#BSVERSIONS#" />
    <!-- -->
</compound>

As this is exactly what happens in their engine (minus the conditions on the header version of course, since it's assumed).

Then anywhere there is User Version 2 it would be replaced with BSStream\Version.

Global Variables

This brings up the next issue, which is globally available variables. Right now, all vercond assume access to the following identifiers in some way or another:

Since they have special meaning, they should probably be tokenized (see #70), and any parser should see the token and replace it with the correct value. This is of course only for vercond--furthering the need for two grammars.

These variables are referenced in cond but only in the Header compound since they are actually local variables in that case. Which brings up the next thing in need of specification: Do we make the version values explicitly unavailable to cond? I could contrive uses for having them in cond (as I mentioned I will illustrate in a comment below), but these mostly involve using ternary expressions, something that we also haven't specified. The important thing is that cond is meant for dynamically changing its size and read/write status based on the values of its antecedent siblings.

Therefore the assumption will be that these global values are only available in the grammar specific to vercond, as special keywords.

Grammar Format/Notation

I have already nearly finished writing the grammars for both cond and vercond using a variant of PEG (Packrat) notation. Specifically, this variant can be parsed by Arpeggio for use by nifxml.py.

We do not have to decide on only one format however. A grammars folder has been added to the repo root and any formats we would like to support can be added to it for each grammar. If the files are not actually parsed directly by a library, etc. then at least the grammar is there for reference.

Grammar Requirements for Each Attribute

Operators

cond vercond arr* arg
Logical ✔️ ✔️
Relational ✔️ ✔️
Arithmetic ✔️ ✔️ ✔️
Bitwise ✔️ ✔️ ✔️꙳
Unary Not (!) ✔️ ✔️
Unary Minus (-) ✔️
Grouping ( ) ✔️ ✔️ ✔️ ✔️
Member access ✔️ ✔️ ✔️

꙳: In the BSVertexDesc case, some masking and bit shifting on a plain int64 inside of arg would have been enough to pass the required information to the compound. So, bitwise could be nice to introduce for arg.

Operands

cond vercond arr* arg
Local Ident ✔️ ✔️ ✔️
Global Ident꙳ ✔️
Version-as-int ✔️
Integer ✔️ ✔️ ✔️ ✔️
Float ✔️ ✔️
#ARG# ✔️ ✔️ ✔️
#TEMPLATE#

꙳: Any columns with an ❌ for this identifier/token means that the grammar would implement these as reserved keywords, but do nothing with them (or error, etc.).

As can be seen by comparing the columns, each attribute needs its own specific grammar even excluding the possible support of bitwise ops in arg, array and args still differ in allowed operands.

Edit: It was overlooked that arr1 already uses bitwise operators for UV Sets.

Example Grammars (WIP)

Still largely WIP, and I haven't implemented some operators yet, like modulus.

Cond (Gist)

Currently has the entire superset of all the grammars, i.e. nothing has been implemented that is not allowed in it that is allowed in other grammars.

There's a very odd case of floating point comparison required for BSLightingShaderProperty for FO4:

<add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ...  />

To assure an exact comparison, the RHS is written as essentially hex bytes, and it is assumed that the expression parser convert the float to hex bytes as well for the LHS.

  1. Should comparing a float with a non-float receive its own special non-terminal in the grammar? This way it is clear that something special is being done.
  2. Should byte-for-byte comparisons receive an entirely separate operator?
  3. Should the use of float literals actually be removed from the grammar as they have no use in arithmetic, etc. (for the purposes of nif.xml)?

Vercond (Gist)

As discussed, only has relational and logical operators. No member access operator, bitwise, or arithmetic operators.

Current Issues:

Arr (Gist)

No relational or logical operators. No floating point numbers. Has member access operators, bitwise, and arithmetic.

Arg

Haven't approached this with any grammar yet. Still not totally sure if it will be necessary.

Note

This ticket is WIP and I will be updating it more this evening, and adding the examples of the PEG grammar once I finalize some things. I do have Arpeggio working fully with the grammars I've defined, and it will be replacing the expression parsing in nifxml.py.

I have yet to elaborate on:

hexabits commented 6 years ago

Regarding array expressions mentioned in the ticket:

The only valid logical operation I could think of for arrays is the ternary operator. There are limited instances where rows are doubled so that behavior for the array length can change, usually based on version. This isn't actually getting addressed directly in this ticket, but it illustrates my point:

<add name="Triangles" ... arr1="Num Triangles" ver2="10.0.1.2" />
<add name="Triangles" ... arr1="Num Triangles" ver1="10.0.1.3" cond="Has Triangles"  />

Without using equality or comparison operators, arr1 could actually be expressed like this:

Num Triangles * !(Has Triangles)

It could also be (Has Triangles || Version < 10.0.1.3) ? Num Triangles : 0. There is really no reason to do this though, and there are many reasons not to. For one it assumes that "Has Triangles" is still present and False even though it technically doesn't exist before 10.0.1.3. The ternary version assumes you can use non-local variables such as Version. You can't, or shouldn't be able to. It's also not specified anywhere.

hexabits commented 6 years ago

I missed that arr1 actually already uses bitwise operators for UV Sets length and have corrected the issue text and tables.

hexabits commented 6 years ago

I have added example WIP grammars to the main ticket. They were collectively too long so I put them on Gist.

hexabits commented 6 years ago

I have added information to the main ticket regarding floating point literals and their use in expressions.

Mostly, this floating point comparison required for BSLightingShaderProperty for FO4:

<add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ...  />

I believe floating point literals should possibly be removed or explicitly disallowed in the grammar. As for the above condition, it makes for a very problematic edge case. The only thing I can possibly think of is this:

<add name="Backlight Power" type="float" cond="(Rimlight Power ^ 0x7F7FFFFF) == 0" ...  />

Except in C++, etc. bitwise operations are simply not defined on floating point values. This does at least carry more connotation that you are looking at the bits and not the actual floating point value.

We could also special-case the value

3.402823466e+38

And allow that as the only floating point comparison. As far as precision goes, numerically the string seems to be able to vary from 3.4028234 to 3.4028235 in the mantissa and still result in a bytes value of 0x7F7FFFFF.

hexabits commented 6 years ago

ident-grammar-bug

Found a slight issue with my identifier regex involving hex literals. It should only be matching the text, not the integers. The solution was to make sure that hex literal parsing comes before identifier parsing, but I am also using the regex directly to extract identifiers from the expression string for some purposes.

I will explore whether I should explicitly exclude matching of hex integers from the identifier regex.

hexabits commented 6 years ago

ident-grammar-fixed

I just went ahead and put a \b in front of it which is more rigorous anyway. Technically, identifiers in most languages can start with _ but it was already intended to be not so in nif.xml grammar.

Note: Chain\Test\Value 1 is there because identifier chain matches are constructed with a sequence of the identifier regex itself. Identifier chains are already matched before single identifiers in the grammar.

Candoran2 commented 2 years ago

For the exact comparison with floats, I don't think it makes sense to write the hex bytes, since there is no indication that it is a float (aside from the data type of the field, but I don't think there should be special names for any types). The most logical choice to me to exactly represent floats would be the hexadecimal literals as explained here:

https://en.wikipedia.org/wiki/IEEE_754#Character_representation

In order to make a distinction between the various float precisions, that would have to be implied by the precision of the hexadecimal number.