mysociety / yournextrepresentative

A website for crowd-sourcing structured election candidate data
https://candidates.democracyclub.org.uk/
GNU Affero General Public License v3.0
56 stars 21 forks source link

WIP: Improve date parsing in candidate forms #918

Closed wfdd closed 8 years ago

wfdd commented 8 years ago

Closes #907.

wfdd commented 8 years ago

The one remaining failure appears to be unrelated to anything I've done. I should probably split the changes to compat into a separate commit, if not a separate PR.

struan commented 8 years ago

( deleted idiotic comment resulting from me not reading anything :( )

wfdd commented 8 years ago

I can't recall if there's anything that remains to be done here - would anybody like to review things?

mhl commented 7 years ago

@wfdd I'm sorry, I feel really bad that I didn't give you more feedback on this at the time. I tried to review it properly, but found the review very heavy going, particularly once I started trying to understand ICU and the date stuff in the CLDR.

One thing that in particular wasn't clear to me was how you selected the skeletons and patterns used in DateParser, given the huge number of possible formats in the Unicode CLDR. It also wasn't clear to me how we would extend it for new languages.

Overall, my feeling is that the DateParser class should be a separate Python package with its own documentation. I think it this could be a very popular package, too. What do you think? Basically, I would be much happier about using it in YNR if the all the complexity was in separately maintained package and codebase - it doesn't seem like something that's core to the project, and it would most likely be useful to many other people.

wfdd commented 7 years ago

No worries! Thanks for taking the time to respond.

It's been some time since I last looked at this, so I'm a little iffy on the details; but I'm sceptical if DateTimePatternGenerator is the right tool for the job here. Trouble is ICU doesn't consistently (?) validate dates and will sometimes happily parse dates of a different spec if they sort of, kinda match a less optimal expanded spec. Now imagine this happening in an unprioritised loop. If you look at lines 78 to 85, I did make a start on writing custom validators.

One thing that in particular wasn't clear to me was how you selected the skeletons and patterns used in DateParser, given the huge number of possible formats in the Unicode CLDR. It also wasn't clear to me how we would extend it for new languages.

Maybe it helps to think of skeletons as the equivalent of gettext message IDs and of patterns as their corresponding localised format strings. [1] The idea was that we wouldn't have to cater for languages individually; it'd just work™ because the Unicode folk have already done all the work of pairing skeletons to patterns, by language (and script, and calendar). To give an example, the skeleton yMd will resolve to DD.MM.YYYY in the tr-TR locale and DD/MM/YY in the el-GR locale.

As for how this will pan out in practice, for manual input, we have to assume that the canonical (short) date format in the Unicode CLDR is the one the people speaking that language in that country actually use in their everyday life or one they're at the very least accustomed to. Copy-and-pasting computer-generated dates should not prove problematic. (Are there any other use cases?)

So, there's quite a few unquestioned assumptions and caveats to this (theoretical and otherwise) we'd have to contemplate. They all generally stem from the fact that the date/time bits of ICU and Unicode aren't suited to NLP; a date object's parse method is only really intended for parsing date strings it itself has generated. On the other hand, there's no good natural-language solution to this problem (that we have access to) that'll work for all languages (or any language bar English), so it might make sense to require input be reasonably regular.

If there's potential to this, I agree that it should be developed independently of YNR.


[1] In truth there's another level of indirection; a generic skeleton is paired to the nearest skeleton 'key' of a particular language's skeleton–pattern pairs.