Overwhelming information with long time series

adrfantini commented 4 years ago

I am testing feasts with a long time series: 5 years, hourly, 45k observations. The default feasts functions seem to struggle a bit with all this information: despite the fact that the series exhibits strong seasonalities (daily, weekly and yearly), these are hard to understand from the feasts output. This might be the case of me using the functions in a wrong way, or maybe the functions could be improved to handle better large datasets.

Here is an example (I'll not upload all plots, just a couple as examples):

download.file("https://srv-file7.gofile.io/download/MtDmN2/all.csv", method = "wget", destfile = "all.csv")

library(tsibble)
library(feasts)
library(readr)
library(tidyr)
library(dplyr)
library(lubridate)
library(purrr)

d = read_csv("all.csv", skip = 1) %>%
    set_names("ds", "tmp", "y") %>%
    select(-tmp) %>%
    mutate(ds = dmy_hms(ds)) %>%
    group_by(ds) %>% summarise(y = mean(y, na.rm = TRUE)) %>% # this averages duplicate rows (there are a handful)
    mutate(ds = ds[1] + 60*60*((1:nrow(.))-1) ) %>% # fix the fact that sometimes the measurement is done 1 or 2 s too late
    as_tsibble(index = ds)

### Visualise with feasts

d %>% gg_season(y, period = "year") # Too much overplotting, even with alpha!
d %>% gg_season(y, period = "day", alpha = 0.1) # Too much overplotting, even with alpha!
d %>% gg_season(y, period = "week", alpha = 0.5) # Too much overplotting, but we start to see something useful

d %>% gg_subseries(y) # Maybe it's the data, but this plot looks too crowded to me as well

d %>% ACF(y) %>% autoplot # This works well to show the daily cycle
d %>% ACF(y, lag_max = 24*30*3) %>% autoplot # We see that the trend is quite stable, but correlation starts to drop after a few weeks (about 0.8 to 0.4 in 3 months), as can be expected

d %>% STL(y ~ season(window = "periodic")) %>% autoplot # STL decomposition is too crowded to be able to understand much

Here are a couple of examples of the above plots (season and subseries): gg_season gg_subseries

Can something be done to improve the visualisation of large(r) datasets?

mitchelloharawild commented 4 years ago

Quick tip with your data cleaning - lubridate::round_date() is a nice way to fix minor irregularities in the measurement time.

It looks to me that the plots are working as intended. They're plotting all of the data you've provided, using the visualisation method specified (season plots, subseries plots, etc).

If the information is overwhelming, you can consider using a different plot type (such as calendar plots with sugrrants) or aggregating/summarising/focusing your data in some way.

adrfantini commented 4 years ago

Quick tip with your data cleaning - lubridate::round_date() is a nice way to fix minor irregularities in the measurement time.

Oooh, thanks!

It looks to me that the plots are working as intended.

Yes, I agree. Everything is working well, however feasts could aim at implementing additional tools that could be applied to larger datasets, such as this one. One example could be using geom_smooth for gg_season, maybe?

aggregating/summarising/focusing your data in some way.

Could feasts maybe facilitate this?

mitchelloharawild commented 4 years ago

Yes, I agree. Everything is working well, however feasts could aim at implementing additional tools that could be applied to larger datasets, such as this one. One example could be using geom_smooth for gg_season, maybe?

As feasts is using ggplot2 to produce the graphics, you should be able to add geom smooth to the result. i.e. data %>% gg_season(y) + geom_smooth().

We have some people in our research group thinking about how time series graphics can be designed for larger datasets, and automatic identification of informative graphics for time series. New methods and tools will be designed to work with these packages.

Could feasts maybe facilitate this?

tsibble supports this by providing temporally aware versions of dplyr verbs.

Aggregate would be to summarise() to combine multiple time series (keys) together. Summarising (or perhaps more accurately, temporal aggregation) can be done with dt %>% index_by(...) %>% summarise(). Focusing on a certain time period can be achieved with filter()

adrfantini commented 4 years ago

data %>% gg_season(y) + geom_smooth()

This smooths every single line, I guess because it is missing a group aesthetic for smoothing the different lines. If I am not mistaken this is something that can't be done outside of gg_season since grouping needs to happen on the the internally defined id variable.

We have some people in our research group thinking about how time series graphics can be designed for larger datasets

That's excellent to know! Feel free to close this or use it as a possible usecase / discussion zone for this.

tidyverts / feasts

Overwhelming information with long time series #75