Open steffilazerte opened 3 years ago
Was looking at this. Could be done with the coordinates.
Currently, it's a bit harder on the user end because the coordinates kept aren't precise enough, but the data form which it's collected does have a more precise set of coordinates.
While doing that I fount out that the timezone are wrong. For example, the stations in Alberta should be GMT-7 not GMT+7. Might be due to the issues with lutz/lubridate? For some reason when trying the functions to get the tz, it returns that most of them aren't valid names...
There's one station with completely wrong coordinates (edmonton eer) from the website. On stations is missing a minus sign (POINT LEPREAU) should be 45.04 -66.26. It's correct on the POINT LEPREAU CS one thought, but it makes the method I though of harder.
I guess there should be first, a check to make sure that the coordinates are in Canada/in the province they say they are...
Hi @salix-d, thanks for the comments!
For the timezones, I'm out of town for a bit so can't check directly, but I think it might depend on whether it's the name Etc/GMT+/-X as those ones are backwards to what we'd expect (otherwise definitely something to fix!)
Good point about the data quality. It might be easier to use a two phase approach, a helper function to create a list of stations to amalgamate using locations (given timeframe and variables), and then a second step/function to actually create a continuous record when given a vector of stations, timeframe and variables of interest (independent of station location).
This would allow/force the user to be in control (take responsibility) of choosing which stations should be combined, but would still help get a list to start with at least. A bit of a compromise given we know there are probably errors in the station records. Another helper function to validate station locations is a good idea too. Perhaps we can report mistakes to ECCC and get them fixed!
I'll be able to be more specific once I'm back π
For the data quality, those were the only two stations with bad coordinates and I sent feedback through their website and they were quick to respond.
There are also 5 that have NA as coordinate, I might enquire about those too. But I think we could guess the coordinates from the city they're in / closest to. Same for the time zones.
I can make a branch, I changed a bit how the data get parsed so it returns the exact coordinates. With that it's pretty easy to make a list of station by coordinates and find the ones that could be amalgamed. This dataset could be already made ad updated itself at the same time as the station data is updated, or done by the user. It's about two line of codes anw.
For the timezone things though, I couldn't make lutz work at all so I don't understand how when I run the update function from the weathercan package it somehow works Oo.
I had this exact issue : https://github.com/ateucher/lutz/issues/10 (from Nov 25, 2020) but it's not a lutz issue it's a lubridate issue. Or maybe an issue with the Rccp library cctz. I didn't manage to make that work either. I ended up getting my data from webscrapping the wikipedia time zone list page π€£ I will look up about the backard thing, but I remember seeing Etc/GMT-4, Etc/GMT+5 and Etc/GMT-6 for one province or territory, and that doesn't seem right.
Since, it's all in Canada, I think we could just have a dataset with the Canadian time zone without needing to rely on external packages. Only a couple prov/territories have multiple time zone, so most of it could be determined by that and the rest with coord limits most likely.
Got a reply for Edmonton Eer :
Yes, weβre sure it was in Edmonton and it was probably installed on the grounds of the Prairies Storm Prediction Centre office at 9250 49th St NW. So, you can use these coordinates: 53.5291 -113.4158.
Not sure if that's gonna be updated on their end, but we can add a line of code to fix the coordinates of that station when updating station data.
Hi @salix-d, I'm back and finally caught up on work so I can take a look at this properly!
Fixing data in weathercan I try really hard not to change data that comes from ECCC. I want weathercan to organize and make it easy for people to get the data, but there can be so many problems with weather data that I think it's really important for issues to be fixed either by ECCC or by the users themselves with their own data. I'm concerned that if we keep making fixes in the package, those fixes may someday become bugs themselves. Because I have only limited time to devote to weathercan (and ECCC keeps me on my toes with changes), I want to keep things as simple as possible (and it's already more complicated than I wanted!)
I love that you've found problems and reported them to ECCC, I think that's the best way forward. Possibly another feature for weathercan could be to have data checks that alert users to potentially problematic data, but that would be a different feature.
Timezones I prefer to use the lutz package because timezones can get quite fiddly in certain areas (northern BC, for example https://en.wikipedia.org/wiki/Time_in_Canada), so coordinate cutoffs won't be very simple π. All the funny timezones calculated are based on
NA
, I've put a fix on the dev branch, thanks for catching that!)Therefore, I prefer to keep the current system for timezones as it does seem to be working and eliminates some of the fiddly business with timezone boundaries.
Lat/Lons I'm so excited that you realized that the Degree/Minute/Second coordinates are sometimes more precise than the decimal degree coordinates! This is something users have been asking for, and I've always had to say we don't have the data. Unless you're talking about another source entirely? i.e. not https://drive.google.com/uc?export=download&id=1HDRnj41YBWpMioLPwAFiLlK4SK8NV72C? You said you made a fix to use exact coordinates, which is great! I made one locally too (before I re-read you comment above :facepalm:).
Identifying continuous stations If you're still interested in implementing a function or two step function for identifying stations to combine I'd love to have your contribution! I'd suggest working off (or merging in) the dev branch as I've put a couple of more recent fixes there. You can include your fix to the coordinates in that pull request (or I can push mine, which ever you prefer).
Somewhat in line with the note above about not changing data, I think I prefer the function to be separate and run specifically by users. Then this feature can be used to help users find good stations, but also gives us the chance to make that more flexible (now or in the future), with things like distance cutoffs (like in station_search()
), which would ultimately allow the users to choose which stations to join. This way they can omit stations they don't trust, or create continuous records even from stations that aren't exactly replacing each other, but which is perhaps good enough for a given study. What do you think?
Finally we might want a way of identifying the best stations for continuous data of a particular type (e.g. temperature data).
I realize that these ideas are changing the original issue I raised from simply identifying stations that replace each other, to creating a tool for identifying the best stations to use to get a continuous data record! Perhaps too much?
I love that you've found problems and reported them to ECCC, I think that's the best way forward. Possibly another feature for weathercan could be to have data checks that alert users to potentially problematic data, but that would be a different feature.
They haven't fix the issues yet, but we can do the same location check they do on their search form to determine if the coordinate are actually in Canada and raise a flag if it's not.
I'm concerned that if we keep making fixes in the package, those fixes may someday become bugs themselves.
That way it's just a flag and doesn't change the data, so we don't have to worry. And if the data get fix, there just won'tbe any flags.
Therefore, I prefer to keep the current system for timezones as it does seem to be working and eliminates some of the fiddly business with timezone boundaries.
Did you manage to make the lutz package work locally? and not like while using the weathercan package? Because I didn't manage to. When running the function line by line to figure what everything did it would always return me 'invalid time-zone' or something like that. So I had to do a work around to just be able to test the functions :/
I'm so excited that you realized that the Degree/Minute/Second coordinates are sometimes more precise than the decimal degree coordinates! This is something users have been asking for, and I've always had to say we don't have the data. Unless you're talking about another source entirely? i.e. not https://drive.google.com/uc?export=download&id=1HDRnj41YBWpMioLPwAFiLlK4SK8NV72C?
But we have the data, it's right in that csv link and that's being fetch by the function, it just gets removed when the data is parsed for some reason.
You said you made a fix to use exact coordinates, which is great! I made one locally too (before I re-read you comment above π€¦).
How did you do the fix without the data then?
I realize that these ideas are changing the original issue I raised from simply identifying stations that replace each other, to creating a tool for identifying the best stations to use to get a continuous data record! Perhaps too much?
I think both feature would serve a purpose!
Flags are a great idea, a good compromise. I think I'm a bit confused by where you're getting the lat/lon information. I look at the CSV file and the lat/lon in decimal degrees (columns Latitude (Decimal Degrees)
and Longitude (Decimal Degrees)
) only have two decimal places (i.e. 48.87
). The other columns (Latitude
and Longitude
) have (sometimes) more precision but need to be converted from degree/min/seconds to decimal degrees.
Are you saying that you can parse the decimal degree columns differently to get more precision?
For lutz, I have no problems:
> lutz::tz_lookup_coords(48.87, -123.28, method = "accurate")
[1] "America/Vancouver"
> packageVersion("lutz")
[1] β0.3.1β
For lutz it's after you get "America/Vancouver" when you try to get the time zone :
lutz::tz_offset(Sys.Date(), "America/Vancouver")
Error: CCTZ: Unrecognized output timezone: "America/Vancouver"
packageVersion("lutz")
[1] β0.3.1β
For the data, I was using the Latitude
and Longitue
columns as is. They usually have more precision, or anyway they had more precision for the cases where the decimal degree with only two decimal was hinting they might be the same station. (As in, it alloed to confirmed they weren't). And for that part , since you just want to know if they are the same between station, you don't really need to convert them for that. The decimal degree is good enough for the time zone.
If people wanted to have them in decimal degree, if there's not already na R function for that, I found the code for it in the source of a webpage converter. I can translate JS into R.
When the next coworking session? Wanna work on that together then?
Ah! I understand the confusion with lutz. I didn't realize that lutz had a tz_offset()
function, weathercan doesn't use that one! The tz_offset()
function in weathercan is an internal function found in R/utils.R
(note that it doesn't have lutz::
before hand).
For lat/lon, we've always had precision to 2 decimal points. You don't see it in the tibble
because of the way tibble truncates data for display (but it's still there). If you convert to data.frame, you can see that the data is there.
> stations() %>% data.frame() %>% head()
prov station_name station_id climate_id WMO_id TC_id lat lon elev tz interval start end
1 AB DAYSLAND 1795 301AR54 NA <NA> 52.87 -112.28 688.8 Etc/GMT+7 day 1908 1922
2 AB DAYSLAND 1795 301AR54 NA <NA> 52.87 -112.28 688.8 Etc/GMT+7 hour NA NA
3 AB DAYSLAND 1795 301AR54 NA <NA> 52.87 -112.28 688.8 Etc/GMT+7 month 1908 1922
4 AB EDMONTON CORONATION 1796 301BK03 NA <NA> 53.57 -113.57 670.6 Etc/GMT+7 day 1978 1979
5 AB EDMONTON CORONATION 1796 301BK03 NA <NA> 53.57 -113.57 670.6 Etc/GMT+7 hour NA NA
6 AB EDMONTON CORONATION 1796 301BK03 NA <NA> 53.57 -113.57 670.6 Etc/GMT+7 month 1978 1979
normals normals_1981_2010 normals_1971_2000
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE FALSE
Or is there still something I'm missing?
I'd love to work on this during coworking, but we'd have to wait until September!
I was testing the station_dl function which use station_dl_internal (https://github.com/ropensci/weathercan/blob/master/R/stations.R line 137) because that's where the longetude and latitude get removed and it does use tz_offset()
.
When testing line by line, when I get to 237, I get the error :
Error in `dplyr::mutate()`:
! Problem while computing `tz = purrr::map_chr(.data$tz, ~lutz::tz_offset(.x))`.
Caused by error:
! If dt is a character or a Date, you must supply a time zone
Run `rlang::last_error()` to see where the error occurred.
when running weathercan:::stations_dl()
I weirdly on get this warning :
Problem while computing `tz = purrr::map_chr(.data$tz, ~tz_offset(.x))`. unknown timezone 'NA'
I really don't understand why the function works like that but not when running it line by line...
Yes, but my previous comment still holds: weathercan should be using the internal function tz_offset()
, not lutz::tz_offset()
In your first example, you've changed the code to use lutz::tz_offset()
but it shouldn't use that one. I wrote an internal function that just happened to have the same name (I didn't realize that lutz had the same function name). weathercan's tz_offset()
is here: https://github.com/ropensci/weathercan/blob/5e1cefa42d89dc6fef0690c72e19a2b84df9e83a/R/utils.R#L1-L10
I'm not sure where the second warning comes from, as weathercan's tz_offset()
currently returns Etc/GMT+0
with NA_character_
or an error with any other kind of NA
.
Also, rather than using weathercan:::stations_dl()
, you might have better luck in trouble shooting by using Ctrl-Shift-L (which calls devtools::load_all(".")
). This way it loads all the internal package functions so you can trouble shoot more precisely.
Make sense?
And if it's still giving you issues, let's have a video chat to figure it out!
Yes make sense! I think I skipped a sentence reading your last reply ^^', with the internal function everything runs smoothly :) And the only time zone that's still negative ("Etc/GMT-6") is the one station (LEPREAU) which has coordinates outside of Canada, so it would be flagged anw.
ECCC weather stations are sometimes replaced, meaning that a new station id is created for data coming from the same location. It can be tricky to amalgamate data from these stations.
It would be nice to have a function that acts on downloaded data to identify stations which are potentially paired and could be homogenized.
See also #116