rezemika / oh_sanitizer

A corrector for the 'opening_hours' fields from OpenStreetMap
GNU Affero General Public License v3.0
2 stars 0 forks source link

Return a validation status #6

Closed frodrigo closed 6 years ago

frodrigo commented 6 years ago

In lot of case, when sanitize_field return a different value from the input is only changes on cosmetic spaces. For Osmose use case it would be right to know is the input value is valid or not, even if it's not the canonical representation.

Minor changes like this one should not be reported as invalid:

-Mo-Fr 07:30-18:30 ; PH off
+Mo-Fr 07:30-18:30; PH off

Do you want to support this concept in you lib, or non ?

rezemika commented 6 years ago

Why not, I agree in the principle. However, I'm not sure how to do that. The space after the semicolon is in the specifications (here) and is parsed as any part of the field. It would require to have two "levels of validness" or something like this.

Or, if only space changes are considered as "minor changes", I could simply do a symmetric difference between input and output, and see if it contains only a space (set("abcd").symmetric_difference(set("ab cd")) == set(" "). Would it fit your needs?

frodrigo commented 6 years ago

If you think it's so simple, and there is no other kind of issues, I can also do it on my side.

rezemika commented 6 years ago

It should work fine, but the downside is that it would ignore all space corrections. But they're quite minor compared to others...

I think it would be quite difficult to do "multi-levels" corrections, because of the way oh_sanitizer works. It parses the field, ignoring all spaces, case errors, etc. Then, the syntaxic tree is read, each value is corrected (all keywords are lowered, matched to the corresponding meaning, then a correct word is returned), and all lists of values are finally joined by spaces or commas according to the specifications.

So, yes, you can ignore space corrections this way. As space errors aren't the most serious or the most frequent errors, it shouldn't harm that much...

>>> field = "Mo-Fr 07:30-18:30 ; PH off"
>>> sanitized_field = sanitize_field(field)
>>> set(sanitized_field).symmetric_difference(set(field)) == set(" ")
True