ropensci / weathercan

R package for downloading weather data from Environment and Climate Change Canada
https://docs.ropensci.org/weathercan
GNU General Public License v3.0
102 stars 29 forks source link

Function to identify potentially continuous stations #117

Open steffilazerte opened 3 years ago

steffilazerte commented 3 years ago

ECCC weather stations are sometimes replaced, meaning that a new station id is created for data coming from the same location. It can be tricky to amalgamate data from these stations.

It would be nice to have a function that acts on downloaded data to identify stations which are potentially paired and could be homogenized.

See also #116

salix-d commented 2 years ago

Was looking at this. Could be done with the coordinates.

Currently, it's a bit harder on the user end because the coordinates kept aren't precise enough, but the data form which it's collected does have a more precise set of coordinates.

While doing that I fount out that the timezone are wrong. For example, the stations in Alberta should be GMT-7 not GMT+7. Might be due to the issues with lutz/lubridate? For some reason when trying the functions to get the tz, it returns that most of them aren't valid names...

There's one station with completely wrong coordinates (edmonton eer) from the website. On stations is missing a minus sign (POINT LEPREAU) should be 45.04 -66.26. It's correct on the POINT LEPREAU CS one thought, but it makes the method I though of harder.

I guess there should be first, a check to make sure that the coordinates are in Canada/in the province they say they are...

steffilazerte commented 2 years ago

Hi @salix-d, thanks for the comments!

For the timezones, I'm out of town for a bit so can't check directly, but I think it might depend on whether it's the name Etc/GMT+/-X as those ones are backwards to what we'd expect (otherwise definitely something to fix!)

Good point about the data quality. It might be easier to use a two phase approach, a helper function to create a list of stations to amalgamate using locations (given timeframe and variables), and then a second step/function to actually create a continuous record when given a vector of stations, timeframe and variables of interest (independent of station location).

This would allow/force the user to be in control (take responsibility) of choosing which stations should be combined, but would still help get a list to start with at least. A bit of a compromise given we know there are probably errors in the station records. Another helper function to validate station locations is a good idea too. Perhaps we can report mistakes to ECCC and get them fixed!

I'll be able to be more specific once I'm back πŸ˜„

salix-d commented 2 years ago

For the data quality, those were the only two stations with bad coordinates and I sent feedback through their website and they were quick to respond.

There are also 5 that have NA as coordinate, I might enquire about those too. But I think we could guess the coordinates from the city they're in / closest to. Same for the time zones.

I can make a branch, I changed a bit how the data get parsed so it returns the exact coordinates. With that it's pretty easy to make a list of station by coordinates and find the ones that could be amalgamed. This dataset could be already made ad updated itself at the same time as the station data is updated, or done by the user. It's about two line of codes anw.


For the timezone things though, I couldn't make lutz work at all so I don't understand how when I run the update function from the weathercan package it somehow works Oo.

I had this exact issue : https://github.com/ateucher/lutz/issues/10 (from Nov 25, 2020) but it's not a lutz issue it's a lubridate issue. Or maybe an issue with the Rccp library cctz. I didn't manage to make that work either. I ended up getting my data from webscrapping the wikipedia time zone list page 🀣 I will look up about the backard thing, but I remember seeing Etc/GMT-4, Etc/GMT+5 and Etc/GMT-6 for one province or territory, and that doesn't seem right.

Since, it's all in Canada, I think we could just have a dataset with the Canadian time zone without needing to rely on external packages. Only a couple prov/territories have multiple time zone, so most of it could be determined by that and the rest with coord limits most likely.

salix-d commented 2 years ago

Got a reply for Edmonton Eer :

Yes, we’re sure it was in Edmonton and it was probably installed on the grounds of the Prairies Storm Prediction Centre office at 9250 49th St NW. So, you can use these coordinates: 53.5291 -113.4158.

Not sure if that's gonna be updated on their end, but we can add a line of code to fix the coordinates of that station when updating station data.

steffilazerte commented 2 years ago

Hi @salix-d, I'm back and finally caught up on work so I can take a look at this properly!

Fixing data in weathercan I try really hard not to change data that comes from ECCC. I want weathercan to organize and make it easy for people to get the data, but there can be so many problems with weather data that I think it's really important for issues to be fixed either by ECCC or by the users themselves with their own data. I'm concerned that if we keep making fixes in the package, those fixes may someday become bugs themselves. Because I have only limited time to devote to weathercan (and ECCC keeps me on my toes with changes), I want to keep things as simple as possible (and it's already more complicated than I wanted!)

I love that you've found problems and reported them to ECCC, I think that's the best way forward. Possibly another feature for weathercan could be to have data checks that alert users to potentially problematic data, but that would be a different feature.

Timezones I prefer to use the lutz package because timezones can get quite fiddly in certain areas (northern BC, for example https://en.wikipedia.org/wiki/Time_in_Canada), so coordinate cutoffs won't be very simple 😞. All the funny timezones calculated are based on

  1. Lack of coordinates (and these should be NA, I've put a fix on the dev branch, thanks for catching that!)
  2. Incorrect coordinates from ECCC (all of which you identified and notified ECCC about, awesome!)

Therefore, I prefer to keep the current system for timezones as it does seem to be working and eliminates some of the fiddly business with timezone boundaries.

Lat/Lons I'm so excited that you realized that the Degree/Minute/Second coordinates are sometimes more precise than the decimal degree coordinates! This is something users have been asking for, and I've always had to say we don't have the data. Unless you're talking about another source entirely? i.e. not https://drive.google.com/uc?export=download&id=1HDRnj41YBWpMioLPwAFiLlK4SK8NV72C? You said you made a fix to use exact coordinates, which is great! I made one locally too (before I re-read you comment above :facepalm:).

Identifying continuous stations If you're still interested in implementing a function or two step function for identifying stations to combine I'd love to have your contribution! I'd suggest working off (or merging in) the dev branch as I've put a couple of more recent fixes there. You can include your fix to the coordinates in that pull request (or I can push mine, which ever you prefer).

Somewhat in line with the note above about not changing data, I think I prefer the function to be separate and run specifically by users. Then this feature can be used to help users find good stations, but also gives us the chance to make that more flexible (now or in the future), with things like distance cutoffs (like in station_search()), which would ultimately allow the users to choose which stations to join. This way they can omit stations they don't trust, or create continuous records even from stations that aren't exactly replacing each other, but which is perhaps good enough for a given study. What do you think?

Finally we might want a way of identifying the best stations for continuous data of a particular type (e.g. temperature data).

I realize that these ideas are changing the original issue I raised from simply identifying stations that replace each other, to creating a tool for identifying the best stations to use to get a continuous data record! Perhaps too much?

salix-d commented 2 years ago

I love that you've found problems and reported them to ECCC, I think that's the best way forward. Possibly another feature for weathercan could be to have data checks that alert users to potentially problematic data, but that would be a different feature.

They haven't fix the issues yet, but we can do the same location check they do on their search form to determine if the coordinate are actually in Canada and raise a flag if it's not.

I'm concerned that if we keep making fixes in the package, those fixes may someday become bugs themselves.

That way it's just a flag and doesn't change the data, so we don't have to worry. And if the data get fix, there just won'tbe any flags.

Therefore, I prefer to keep the current system for timezones as it does seem to be working and eliminates some of the fiddly business with timezone boundaries.

Did you manage to make the lutz package work locally? and not like while using the weathercan package? Because I didn't manage to. When running the function line by line to figure what everything did it would always return me 'invalid time-zone' or something like that. So I had to do a work around to just be able to test the functions :/

I'm so excited that you realized that the Degree/Minute/Second coordinates are sometimes more precise than the decimal degree coordinates! This is something users have been asking for, and I've always had to say we don't have the data. Unless you're talking about another source entirely? i.e. not https://drive.google.com/uc?export=download&id=1HDRnj41YBWpMioLPwAFiLlK4SK8NV72C?

But we have the data, it's right in that csv link and that's being fetch by the function, it just gets removed when the data is parsed for some reason.

You said you made a fix to use exact coordinates, which is great! I made one locally too (before I re-read you comment above 🀦).

How did you do the fix without the data then?

I realize that these ideas are changing the original issue I raised from simply identifying stations that replace each other, to creating a tool for identifying the best stations to use to get a continuous data record! Perhaps too much?

I think both feature would serve a purpose!

steffilazerte commented 2 years ago

Flags are a great idea, a good compromise. I think I'm a bit confused by where you're getting the lat/lon information. I look at the CSV file and the lat/lon in decimal degrees (columns Latitude (Decimal Degrees) and Longitude (Decimal Degrees)) only have two decimal places (i.e. 48.87). The other columns (Latitude and Longitude) have (sometimes) more precision but need to be converted from degree/min/seconds to decimal degrees.

Are you saying that you can parse the decimal degree columns differently to get more precision?

For lutz, I have no problems:

> lutz::tz_lookup_coords(48.87, -123.28, method = "accurate")
[1] "America/Vancouver"
> packageVersion("lutz")
[1] β€˜0.3.1’
salix-d commented 2 years ago

For lutz it's after you get "America/Vancouver" when you try to get the time zone :

lutz::tz_offset(Sys.Date(), "America/Vancouver")
Error: CCTZ: Unrecognized output timezone: "America/Vancouver"
packageVersion("lutz")
[1] β€˜0.3.1’

For the data, I was using the Latitude and Longitue columns as is. They usually have more precision, or anyway they had more precision for the cases where the decimal degree with only two decimal was hinting they might be the same station. (As in, it alloed to confirmed they weren't). And for that part , since you just want to know if they are the same between station, you don't really need to convert them for that. The decimal degree is good enough for the time zone.

If people wanted to have them in decimal degree, if there's not already na R function for that, I found the code for it in the source of a webpage converter. I can translate JS into R.

salix-d commented 2 years ago

When the next coworking session? Wanna work on that together then?

steffilazerte commented 2 years ago

Ah! I understand the confusion with lutz. I didn't realize that lutz had a tz_offset() function, weathercan doesn't use that one! The tz_offset() function in weathercan is an internal function found in R/utils.R (note that it doesn't have lutz:: before hand).

For lat/lon, we've always had precision to 2 decimal points. You don't see it in the tibble because of the way tibble truncates data for display (but it's still there). If you convert to data.frame, you can see that the data is there.

> stations() %>% data.frame() %>% head()
  prov        station_name station_id climate_id WMO_id TC_id   lat     lon  elev        tz interval start  end
1   AB            DAYSLAND       1795    301AR54     NA  <NA> 52.87 -112.28 688.8 Etc/GMT+7      day  1908 1922
2   AB            DAYSLAND       1795    301AR54     NA  <NA> 52.87 -112.28 688.8 Etc/GMT+7     hour    NA   NA
3   AB            DAYSLAND       1795    301AR54     NA  <NA> 52.87 -112.28 688.8 Etc/GMT+7    month  1908 1922
4   AB EDMONTON CORONATION       1796    301BK03     NA  <NA> 53.57 -113.57 670.6 Etc/GMT+7      day  1978 1979
5   AB EDMONTON CORONATION       1796    301BK03     NA  <NA> 53.57 -113.57 670.6 Etc/GMT+7     hour    NA   NA
6   AB EDMONTON CORONATION       1796    301BK03     NA  <NA> 53.57 -113.57 670.6 Etc/GMT+7    month  1978 1979
  normals normals_1981_2010 normals_1971_2000
1   FALSE             FALSE             FALSE
2   FALSE             FALSE             FALSE
3   FALSE             FALSE             FALSE
4   FALSE             FALSE             FALSE
5   FALSE             FALSE             FALSE
6   FALSE             FALSE             FALSE

Or is there still something I'm missing?

I'd love to work on this during coworking, but we'd have to wait until September!

salix-d commented 2 years ago

I was testing the station_dl function which use station_dl_internal (https://github.com/ropensci/weathercan/blob/master/R/stations.R line 137) because that's where the longetude and latitude get removed and it does use tz_offset().

When testing line by line, when I get to 237, I get the error :

Error in `dplyr::mutate()`:
! Problem while computing `tz = purrr::map_chr(.data$tz, ~lutz::tz_offset(.x))`.
Caused by error:
! If dt is a character or a Date, you must supply a time zone
Run `rlang::last_error()` to see where the error occurred.

when running weathercan:::stations_dl() I weirdly on get this warning :

Problem while computing `tz = purrr::map_chr(.data$tz, ~tz_offset(.x))`. unknown timezone 'NA'

I really don't understand why the function works like that but not when running it line by line...

steffilazerte commented 2 years ago

Yes, but my previous comment still holds: weathercan should be using the internal function tz_offset(), not lutz::tz_offset()

In your first example, you've changed the code to use lutz::tz_offset() but it shouldn't use that one. I wrote an internal function that just happened to have the same name (I didn't realize that lutz had the same function name). weathercan's tz_offset() is here: https://github.com/ropensci/weathercan/blob/5e1cefa42d89dc6fef0690c72e19a2b84df9e83a/R/utils.R#L1-L10

I'm not sure where the second warning comes from, as weathercan's tz_offset() currently returns Etc/GMT+0 with NA_character_ or an error with any other kind of NA.

Also, rather than using weathercan:::stations_dl(), you might have better luck in trouble shooting by using Ctrl-Shift-L (which calls devtools::load_all(".")). This way it loads all the internal package functions so you can trouble shoot more precisely.

Make sense?

steffilazerte commented 2 years ago

And if it's still giving you issues, let's have a video chat to figure it out!

salix-d commented 2 years ago

Yes make sense! I think I skipped a sentence reading your last reply ^^', with the internal function everything runs smoothly :) And the only time zone that's still negative ("Etc/GMT-6") is the one station (LEPREAU) which has coordinates outside of Canada, so it would be flagged anw.