nielstron / quantulum3

Library for unit extraction - fork of quantulum for python3
MIT License
135 stars 64 forks source link

connects abbreviations - interpret unusual words as unit #127

Open liarig opened 5 years ago

liarig commented 5 years ago

Describe the bug connects abbreviations together, what doesn't make sense

from quantulum3 import parser
>>> parser.parse('1 pplga')
[Quantity(1, "Unit(name="pint pint litre gigayear", entity=Entity("unknown"), uri=None)")]

Expected behavior

>>> parser.parse('1 pplga')
[Quantity(1, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]
nielstron commented 5 years ago

Thanks for your issue. The beviour you describe is expected. The tool does interpret everything as a unit that is not a common English word. Do you have a proposal to improve this behavior? Maybe one could disregard all units where two times the same unit appears. But sometimes this is wanted as in i.e. km² which could be written as km*km

liarig commented 5 years ago

Thank you for your response. I think that the case when the same unit appears more than one time should be considered only if this unit may be multidimensional (like in your example: length - square). Otherwise it may be disregarded.

Interpreting different abbreviations written together as a compound measure may leads to the mistake.

>>> parser.parse('a gin')
[Quantity(1, "Unit(name="gram inch", entity=Entity("unknown"), uri=None)")]
nielstron commented 5 years ago

only if this unit may be multidimensional

On what basis would this than be decided. I can only imagine storing for every value whether there are multidimensional cases or not, which sounds to me like huge overhead, prone for errors.

Interpreting different abbreviations written together as a compound measure may leads to the mistake.

Currently, the most common 10.000 words of the English language are disregarded as "could be a unit". If you find additional words that are common (in the best case a whole list of them) or have a better idea for filtering, I'd be glad to integrate them.

nielstron commented 5 years ago

Actually this in in some form a duplicate of #35