tidyverse / lubridate

Make working with dates in R just that little bit easier
https://lubridate.tidyverse.org
GNU General Public License v3.0
733 stars 207 forks source link

Feature Request: is_ISO8601 #867

Closed billdenney closed 2 years ago

billdenney commented 4 years ago

I have an application where I need to be able to detect if a character string is formatted as required by ISO 8601. Given that #629 / #700 to format date-times as ISO 8601 was a good fit here with format_ISO8601(), I thought that a detection method would also be useful here.

What would you think about a function (or small family of functions) that was named something like is_ISO8601() which could do the following:

vspinu commented 4 years ago

What about vectorised use case?

In order to have this one would need a dedicated ISO8601 parser, which doesn't seem to be worth the effort for such a tiny use case.

billdenney commented 4 years ago

Maybe I'm over-simplifying it, but I think that it could be a set of small functions with reasonably straight-forward regular expressions and grepl calls (therefore inherent vectorization).

I agree that a specialized parser would not be worthwhile.

If the grepl method doesn't seem like a good fit, no worries.

vspinu commented 4 years ago

You might be right actually, but this is low priority. If you can put together a PR and a bunch of tests for it, I would be more than happy to include it in the code base.

billdenney commented 4 years ago

The regular expressions become convoluted, but I am algorithmically building them in a way that makes them reasonable to review (e.g. make the year part then use that to make the date part then use that and the time part to make a whole regexp). And, I'm building many tests for each part, so that it should be understandable.

This is now a work in progress.

billdenney commented 3 years ago

With a lot of work, I now have a super-regexp and the ability to generate all variants (optional second, minute, hour, day, week/month, year). The regexp itself is a beast:

(?:(158[3-9]|159[0-9]|1[6-9][0-9]{2}|[2-9][0-9]{3})(?:(?:-(0[1-9]|1[0-2])(?:-(?:(0[1-9]|[12][0-9]|3[01])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?|-W(0[1-9]|[1-4][0-9]|5[0-3])(?:-(?:([1-7])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?|(?:-(?:(00[1-9]|0[1-9][0-9]|[12][0-9]{2}|3[0-5][0-9]|36[0-6])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?))?)?

With this easier to review visualization.

The part that I'd prefer to be able to fix is making it so that time is only represented once. I think that look-ahead and look-behind regexps may be the right answer, but I don't understand enough about them yet to be sure that's correct.

vspinu commented 2 years ago

Sorry for not coming back on this earlier. But I am afraid this is too complex. I am pretty shure there should be a C or C++ code somewhere to test for this. Otherwise it's probably not very difficult to write our own.

billdenney commented 2 years ago

Yeah, it makes sense that this isn't a good fit as-is.