globalgov / messydates

R package for Extended Date/Time Format (EDTF)
https://globalgov.github.io/messydates
Other
15 stars 1 forks source link

Speed up expand functions #49

Closed BBieri closed 2 years ago

BBieri commented 2 years ago
########################## Benchmarking as.Date() ##############################

# ARCHIGOS has 17686 observations and resolving the Beg variable by taking the
# minimum of each date takes about 5 minutes on a 4th gen AMD
# processor. We should think about cutting this processing time down to allow
# fast documentation rendering of many-universe datasets and quick resolution of variables when 
# joining these datasets.

length(manystates::leaders$ARCHIGOS$Beg)
#> [1] 17686
system.time(as.Date(manystates::leaders$ARCHIGOS$Beg, min))
#>    user  system elapsed 
#>  308.26    3.49  319.01

Created on 2022-04-05 by the reprex package (v2.0.1)

henriquesposito commented 2 years ago

Great, thank you Bernhard! And yes, you are right, we should find out ways of speeding up this process.

However, just FYI, in messydates how dates are actually resolved is not slow at all, these are often very simple functions. What slows down the resolve process is the fact that date ranges and sets, as well as other imprecise or ambiguous dates, are expanded individually before being resolved (the resolve functions call expand() beforehand). You can see that if you "profile" the resolve functions (see the profvis package for profiling for example).

BBieri commented 2 years ago

Hi Henrique, thanks for the quick reply and the clarifications ! I've just assigned you this issue as well since you have the most experience with the code but I am still here in case you need any help. Let me know :)

jhollway commented 2 years ago

Could expansion be accelerated by checking the vector for unspecified, negative, sets, etc? Might demand that we create some more is*() logical tests for these annotations.

henriquesposito commented 2 years ago

Yes, thank you, that is indeed one option! I will look into that.

We can also add an extra logical argument for expand() not to expand ranges or sets since for those we can actually get the min and max dates, for example, without expanding the whole range or set.

jhollway commented 2 years ago

Yes, that should also accelerate things.

henriquesposito commented 2 years ago

I have tried to both add an extra logical argument for expand() not to expand ranges or sets and adding an extra logical test for uncertain dates (is_uncertain()). Both appear to accelerate things a great deal by simply identifying when expansion is necessary (and not).

However, since the logical tests are a bit more flexible and can be used for all resolve methods, this is how it is currently implemented.

For example:

# main branch (currently on CRAN)
#system.time(as.Date(manystates::leaders$ARCHIGOS$Beg, min))
#    user     system   elapsed
# 144.502   0.698   145.284  
# develop branch (with new logical tests is_uncertain())
system.time(as.Date(manystates::leaders$ARCHIGOS$Beg, min))
#    user   system  elapsed
#  0.308   0.027   0.337 

Now although this should already solve the issue, going through expand made me realise that most of what is done by the function is not actually expanding messydates dates, but formatting dates to the point they are "expandable". @jhollway I am wondering if we should not front-load this formatting a bit more into as_messydate() instead? For example, as_messydate("2001") could already return "2001-01-01..2001-12-31. Please let me know what you think, thank you.

henriquesposito commented 2 years ago

Also, thank you @BBieri for raising this issue, great catch!

jhollway commented 2 years ago

Excellent @henriquesposito , this is a very considerable improvement!

I don't think we should already convert "2001" into "2001-01-01..2001-12-31", as this is considerable more verbose. More succinct representations of messydates should always be preferred.