unitsofmeasurement / indriya

JSR 385 - Reference Implementation
Other
115 stars 40 forks source link

EBNFUnitFormat won't accept the letter "e" as a unit alias. #250

Open wnreynoldsWork opened 4 years ago

wnreynoldsWork commented 4 years ago

The following code:

`import java.io.PrintStream; import java.io.UnsupportedEncodingException; import java.util.Locale; import java.util.ResourceBundle;

import tech.units.indriya.format.EBNFUnitFormat; import tech.units.indriya.format.SymbolMap; import tech.units.indriya.unit.Units;

/**

}`

Fails with the following error:

`Exception in thread "main" javax.measure.format.MeasurementParseException: tech.units.indriya.internal.format.TokenException: Encountered " "e" "e "" at line 1, column 1. Was expecting one of: "(" ...

... ... ... ... ... ... ... ... at tech.units.indriya.format.EBNFUnitFormat.parse(EBNFUnitFormat.java:272) at tech.units.indriya.format.EBNFUnitFormat.parse(EBNFUnitFormat.java:281) at tech.units.indriya.format.EBNFUnitFormat.parse(EBNFUnitFormat.java:286) at com.stellarscience.unitparsertest.TestEBNF.main(TestEBNF.java:35)` The grammar for the EBNF format indicates that a single initial character is allowed, and that all letters are allowed initial characters.
keilw commented 4 years ago

I think "e" is a special case because it also exists in the textual representation of numbers like "1.60217662e-19". If you have a unit called "e" the "1.60217662e" would make it difficult to tell apart whether it is a number or a quantity.

wnreynoldsWork commented 4 years ago

I don't think it's that hard, given that you seem to be using a JavaCC parser. I've implemented a number of parsers that handle floating point numbers and variables named 'e' or 'E' using ANTLR. I can send a simple example if that would help. If you could make your JavaCC grammar (I couldn't find it in the repo), I'm pretty sure I could provide an example.

wnreynoldsWork commented 4 years ago

Here's a fragment of an ANTLR grammar that will parse strings, including 'e' into a WORD token as well as full floating point numbers with exponential notation. Again, if you could provide the JavaCC grammar, I think I could add something like this so that you could distinguish 'e' in a floating point number from 'e' as a variable name.

` // Number parsing INTEGER: PLUS_OR_MINUS? DIGITS;

REAL : (INTEGER POINT? DIGITS? (E INTEGER)) | (INTEGER POINT) | (INTEGER? POINT DIGITS) ;

E : 'e' | 'E' ;

PLUS_OR_MINUS : (PLUS|MINUS) ;

PLUS : '+' ;

MINUS : '-' ;

POINT : '.' ;

WORD: LETTER (LETTER|DIGIT|'_'|'-'|'.'|'/'|'\')*; LETTER: [a-zA-Z]; DIGITS: DIGIT+; DIGIT: [0-9]; `

keilw commented 4 years ago

We haven't been using this for a long time, EBNFUnitParser was maintained manually. Do you think we could use the example mentioned here with something like https://github.com/phax/ph-javacc-maven-plugin?

wnreynoldsWork commented 4 years ago

I've never used it, but I don't see why not. We used the maven antlr plugin as well as some custom ant tasks to build the parsers in our project.

keilw commented 4 years ago

You think Ant is necessary? It may not be a problem either because at least in https://github.com/unitsofmeasurement/unit-api we already use it for the Java Mondule system (there could be newer Maven plugins for that soon)

wnreynoldsWork commented 4 years ago

I don't know if ant is necessary. Our application did some pretty involved stuff (we used antlr to parse a proprietary grammar format, converting the proprietary grammar to a an antlr grammar and then built a parser from the generated using antlr, so we had to create some custom ant tasks).

Just running JavaCC (or ANTLR) once to create parser code for the application should be fairly straightforward (with the caveat that I have not used JavaCC myself).

keilw commented 4 years ago

This could be used as a starting point btw, it was not directly used in JSR 275 either but the codebase still exists: https://github.com/unitsofmeasurement/jsr-275/blob/master/src/main/javacc/javax/measure/unit/format/UnitParser.jj

A generated output should more or less act as drop-in replacement for UnitFormatParser (which still shows traces of JavaCC, but it had long been manually updated)

wnreynoldsWork commented 4 years ago

Bingo. I'll have a look.

keilw commented 4 years ago

Thanks, btw, for anything beyond code snippets and exchanging issue tickets, because Indriya is the RI of JSR 385, to raise actual PRs you may have to join the JCP at least as Associate Member, or are you already? See https://github.com/unitsofmeasurement/unit-api/wiki/Contribution-Guide

wnreynoldsWork commented 4 years ago

I am not, I'm just some random guy on the internet. I'll have a look.

wnreynoldsWork commented 4 years ago

Still reviewing this grammar (and learning JavaCC), but it seems the problem is the grammar supports log(unit), ln(unit) and e^unit. While the parser accepts these constructs, it does not seem to process them correctly (they return incorrect dimensions).

`import java.io.PrintStream; import java.io.UnsupportedEncodingException; import java.util.Locale; import java.util.ResourceBundle;

import tech.units.indriya.format.EBNFUnitFormat; import tech.units.indriya.format.SymbolMap;

/**

} `

The output of this test code is:

Unit of lnW: [W?] Dimension of lnW: [L]²·[M]/[T]³ Unit of e^W: [W?] Dimension of e^W: [L]²·[M]/[T]³ Unit of W: W Dimension of W: [L]²·[M]/[T]³

I don't know how to interpret [W?] (this is printing out in unicode), but the dimensions are certainly incorrect. Is supporting these transcendental units part of the spec? Would it be possible to eliminate ln, log and e^ from the unit parser?

wnreynoldsWork commented 4 years ago

Still reviewing this.

The problem is that defining an alias 'e' conflicts with the reserved character 'e' used for Euler's number. The parser allows constructs like 'e^identifier'. With a unit aliased 'e', the parser doesn't know what to do, and throws an error. BTW, is not listed as a reserved character in the javadoc for EBNFUnitFormat ("log", "log" and "ln" are).

So how to solve? 1) (above) get rid of log, ln and e as reserved words. Update javadoc. 2) Instead of using 'e' as a reserved character and looking for constructs like e^thing, replace with exp(thing). The sequence 'exp' becomes reserved, but this is less problematic, imho. Fix parser to support exp, log, ln, update javadoc. 3) Leave unchanged, update javadoc, fix parser to support e^, log, ln.

I'll try to get some small demo parsers of each case along with a maven build script. Note that I will not try to correct the rest of the codebase to support transcendentals of units.

I've asked the bosses about whether I can sign the stuff necessary to be a JSR member, will keep you posted. If I can't, I'll just send the demo code directly to Werner instead of doing a PR.

wnreynoldsWork commented 4 years ago

Ok, I've prepared a repo that demonstrates how to build JavaCC UnitParser grammars using Maven. I've provided several simplified grammars that eliminate features from EBNF like exponents, logarithms, sums and the mix construct. Candidly, I don't quite understand how these constructs work in a unit parsing context - see my comments above. At the very least, they should be documented with use cases - I didn't see anything describing them in the JSR document at https://download.oracle.com/otndocs/jcp/units-2_0-pr-spec/index.html.

This repo also provides a simple demo of setting up aliases using the EBNF parser. See the README.

https://github.com/wnreynoldsWork/EBNFParserDemos

keilw commented 3 years ago

EBNFUnitFormat is an implementation detail, so the parsing details would not have a place in the Specification of the Unit API, if anywhere then a future version of https://github.com/unitsofmeasurement/uom-guide. We might include a generated grammar in a version beyond MR1, but I guess for 2.1 it is a bit risky.