lfoppiano / grobid-quantities

GROBID extension for identifying and normalizing physical quantities.
https://grobid-quantities.readthedocs.io
Apache License 2.0
74 stars 24 forks source link

Numerical value as exponent on 10s #7

Closed kermitt2 closed 1 year ago

kermitt2 commented 8 years ago

The quantity CRF model recognizes numerical expressions with exponents on 10 (in particular distorted one due to PDF text extraction):

example_exponent

However we are not currently parsing it (in their "noisy" form) to actual BigDecimal values.

lfoppiano commented 6 years ago

Regarding this subject, I've started implementing a CRF model for parsing that.

The idea is to classify the value, like for example 5 x 10-5 as:

<val>5</val>
<operation>x</operation> --> althought this could be just <other>? 
<base>10</base>
<pow>-5</pow>

Regarding the resulting value, we need to accept that it will be not precise, as we have to approximate (we are talking about small values).

I would say we should save both form, the "structured" parsed value the schema above, and the parse value as BigDecimal to have an approximate value.

Does it make sense?

kermitt2 commented 6 years ago

Hello!

The value parser should be generic enough to cover several cases described in #13.

The tagset has to be relevant for the different cases:

kermitt2 commented 6 years ago

Regarding the approximation, the idea to introduce BigDecimal was to offer the possibility to set the precision, scaling and rounding, so that we can avoid the usual issue where 1 becomes 1.000000001.

lfoppiano commented 5 years ago

I've be been remarked that 2.3E5 correspond to https://en.wikipedia.org/wiki/Scientific_notation#E-notation meaning that it's a special case of the 10 power.

Was this the initial though you had, @kermitt2 ?

What do you mean with exponential function? I think I have misunderstood that part...

kermitt2 commented 5 years ago

I would make the hypothesis that this notation is not relevant to scientific papers where the e is indeed always the exponential function - why would someone use the calculator/old program notation in a typeset scientific paper?

<exp> for the value of the exponential function (E/e kept non annotated) -> in this context we just annotate the value of the exponential, not the exponential function e, e.g. 10e5 -> <number>10</number>e<exp>5</exp> (note that we could maybe find something better than <number> as tag name)

lfoppiano commented 5 years ago

So this is the exponential function. Right? But then what about other functions like, for example log ? Since we don't have much data, maybe we could just support the exp "in future"?

kermitt2 commented 5 years ago

I think this is the exponential function yes.

I think we don't express a value with a log (we don't write 3log(5)), if we have a log normally it's in an equation with variables, not as value.

mmm I dont see why would we wait for the future? The exponent e is quite frequent as part of a value and it is not complicated.