rOpenGov / fmi

Finnish Meteorological Institute open data API R client
Other
10 stars 7 forks source link

Weather stations #18

Closed ilarischeinin closed 7 years ago

ilarischeinin commented 7 years ago

At the moment, a list of weather stations is included in inst/extdata/weather_stations.csv. I suppose the reason for the inclusion is that these are not available through the API. I have some possible minor contributions in mind, and would like to hear the package authors' views on three points:

Which stations should be included?

Right now, the list contains only stations of type "Sääasemat". This is the most common type, and can be used to retrieve information on e.g. temperature and wind. But other types of stations, such as "Poijut" (waves) and "Mareografit" (sea level) are not included.

I think it could be argued either way whether it should return all station types (for completeness) or only "Sääasemat" (for consistent behavior). One should also note that the function is called fmi_weather_stations(), not fmi_stations().

However, function valid_fmisid() also uses this list to check for the validity of FMISIDs. For example, 134220 is the FMISID of the wave buoy of Northern Baltic Proper, but this function says that it is not a valid FMISID. This use would point towards including all stations (possibly with a new function fmi_stations()).

Whether to include the list in the package, or to scrape live from the web?

The downside of including the list within the package is of course that it can get out-of-date (as it apparently has to some extent). Another option could be to include a function that scrapes it live from the FMI website, something along the lines of:

stations <- XML::readHTMLTable("http://ilmatieteenlaitos.fi/havaintoasemat",
    which=1L, stringsAsFactors=FALSE) %>%
  tbl_df() %>%
  mutate(
    FMISID=FMISID %>% as.integer(),
    LPNN=LPNN %>% as.integer(),
    WMO=WMO %>% as.integer(),
    Lat=Lat %>% sub(pattern=",", replacement=".") %>% as.numeric(),
    Lon=Lon %>% sub(pattern=",", replacement=".") %>% as.numeric(),
    Korkeus=Korkeus %>% sub(pattern="\n.*$", replacement="") %>% as.integer(),
    Alkaen=Alkaen %>% as.integer()
  )

Downsides include the additional package dependency (XML), and of course the fact that just like the included list can get out-of-date, this function can break at any time if FMI changes their website. Naturally, there's also the hybrid option of including one version of the list within the package and also providing a function to download it from the web.

Which language should be used?

The list can of course be in either Finnish or English. When scraping, this would apply to both column names (e.g. "Ryhmät" vs. "Groups") and contents ("Sääasemat" vs. "Weather stations"). Naturally, most (and probably practically all) use for this package is from people who have at least some level of understanding of Finnish. But sill, one could argue for keeping everything in English (like command names and documentation are).

jlehtoma commented 7 years ago

Thanks for your thoughts!

Background

You're right, the reason there is only a copy-pasted CSV included in the package is that the listing is not available through the API. This was a quick-and-dirty solution that can be improved.

Which stations should be included?

Including all sorts of stations, not only weather, stations would make sense. There was never an explicit decision to include only weather stations, but I think that at the time the list was mostly dominated by them. Looking at the current list at http://ilmatieteenlaitos.fi/havaintoasemat there seems to be more variability now. Renaming fmi_weather_stations() to fmi_stations for more inclusive functionality is not an issue at this point.

Whether to include the list in the package, or to scrape live from the web?

Scraping the table of observation stations directly from the web would make sense. I don't think adding a dependency package is that bad, although XML in particular did cause some issues while back (additional dependencies if compiled from source and broken binaries on CRAN). This was few years ago already, so the issues may have been sorted. Should this be an issue, then rvest (which depends on xml2) is another option. Having said that, your implementation above looks good and clean.

If the API changes, then other parts of the fmi package will break too; this is not specific to the station listing. This issue could be mitigated by developing basic tests (see #13 ) that would run against the API. This way, if something changes in the API, we would know (which is good). E.g. rOpenSci has set a cron jobs to daily push packages to TravisCI just so that API tests are run.

Which language should be used?

I agree, although most users are probably Finnish, keeping everything in English is more straight forward. Should the need for using Finnish ever arise, we could address it then.

Feel very welcome to leave another PR with the implementation. In case you're interested, I'm very happy to add you as a collaborator as well.

ilarischeinin commented 7 years ago

Thanks. Pretty much as I thought.

I'll write an implementation for a fmi_stations() function. These are all open to suggestions, but currently my thoughts are to:

  1. Download with rvest/XML (and to list the packages under Suggests so that they are not required).
  2. Cache the result for the rest of the session.
  3. Include all station types, maybe with an optional groups parameter that can be used to subset. That way fmi_weather_stations() could be defined as fmi_stations(groups="Weather stations"), or dropped altogether if deemed unnecessary.
  4. Use English.
  5. Maybe also include a static copy within the package, in case the download fails (due to a redesigned website), or rvest/XML not being installed.

I think that should cover all bases quite well.

Changes in API would surely break things, but I meant that since the station list is not available through the API and would instead be scraped just from the website, I'd imagine the risk of a breaking change is much bigger. That's why I thought point number 5 could be included above.

I can just go through my own fork and a PR, but happy to be listed as a collaborator too. I'll make a PR anyway so that the code gets reviewed.

jlehtoma commented 7 years ago

Excellent, few quick additional comments.

Include all station types, maybe with an optional groups parameter that can be used to subset. That way fmi_weather_stations() could be defined as fmi_stations(groups="Weather stations"), or dropped altogether if deemed unnecessary.

This is up to you, but I would leave the group-based filtering to the user.

Changes in API would surely break things, but I meant that since the station list is not available through the API and would instead be scraped just from the website, I'd imagine the risk of a breaking change is much bigger. That's why I thought point number 5 could be included above.

While I see the logic of not depending on the HTML table (or rvest/XML) being available, this does open a door to potentially conflicting behavior in the case where the HTML table has been updated and static table in the package has not. Coming back to tests, this could be avoided by creating tests that compare the static table to the HTML table. This way we would know when to update the static table (I don't really expect the table being updated that often).

Ideally the station listing too would be available over the API, perhaps this could be tipped to the FMI.

Finally, note that as described in #14 , there is a new high-level API in the makings for the fmi package itself. I think fmi_stations() should be eventually placed in api.R rather than utilities.R, but this is a minor thing.

ilarischeinin commented 7 years ago

This is up to you, but I would leave the group-based filtering to the user.

I agree it could (and maybe should) be left to the user. I was thinking more about the current fmi_weather_stations() and how it could be redefined as fmi_stations(groups="Weather stations"). But if it's fine to remove the old command altogether, that's probably the simplest option.

Coming back to tests, this could be avoided by creating tests that compare the static table to the HTML table.

Good idea. Do you mind if I use testthat for this (and maybe other tests in the future)?

I think fmi_stations() should be eventually placed in api.R rather than utilities.R, but this is a minor thing.

I'll put it in utils.R for now, so that I can branch off master. It's easy to move later anyway.

jlehtoma commented 7 years ago

We don't have to worry about backward compatibility with the function names at this point, so I think renaming the function is fine.

testthat is the way to go, I'm planning to write some tests for the API-functions using as well.

ouzor commented 7 years ago

Great to have you contributing here, Ilari!

Just a quick thought on the function names and combatibility, it would be easy to have the old fmi_weather_stations() that simply calls the new function fmi_stations(groups="Weather stations"), right?

br, Juuso

On Sun, Oct 9, 2016 at 12:07 PM, Joona Lehtomäki notifications@github.com wrote:

We don't have to worry about backward compatibility with the function names at this point, so I think renaming the function is fine.

testthat is the way to go, I'm planning to write some tests for the API-functions using as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rOpenGov/fmi/issues/18#issuecomment-252474401, or mute the thread https://github.com/notifications/unsubscribe-auth/ABK-iq1p8ApabVJ8MCeNSWoDm8-ZbpAbks5qyK68gaJpZM4KQDOK .

jlehtoma commented 7 years ago

Thanks for chiming in, @ouzor !

Just a quick thought on the function names and combatibility, it would be easy to have the old fmi_weather_stations() that simply calls the new function fmi_stations(groups="Weather stations"), right?

Yes, it would be a simple as that, and if you think this is important then by all means. However, from design point of view I think if we do this then such wrappers should be implemented for all station types for consistency (e.g. fmi_tower_stations()). I would put less priority on just retaining the compatibility since the package (version 0.1.13) has no stable API. This doesn't mean, of course, that we shouldn't go for compatibility where possible.

ilarischeinin commented 7 years ago

testthat is the way to go, I'm planning to write some tests for the API-functions using as well.

It seems that right now all tests that have been written are in inst/tests/test.R. Shall I add to that file, or make a new one in e.g. tests/testthat/fmi_stations.R, which I think would be the standard for testthat?

jlehtoma commented 7 years ago

I hadn't even noticed that there are tests in inst (not the original author of the package), good catch! I just added some tests related things in master branch, most notably, tests subdir. I've usually named test files in tests/testthat with test_XX prefix where XX is an numeric ID. This way the order of tests can be controlled if needed, but feel free to do as you like.

antagomir commented 7 years ago

Regarding the comment from @ouzor - why not use the .Deprecated function within the old function for a while? This will produce a warning message that the old function is going to be replaced soon. But I also agree that for this particular pkg the backwards compatibility is not yet such an issue, just replacing the function name would do as well.

ilarischeinin commented 7 years ago

Actually that is exactly what I already did. That way things shouldn't break for anyone (even though they might not even expect the API to be stable), but it also kind of justifies for having no corresponding functions for other groups, like fmi_tower_stations().

At the moment, I have the function return a tibble. Is that ok? Compared to returning a vanilla data.frame it's a small added convenience for dplyr users, and I'd imagine for others it should appear just as a regular data.frame. But is there something I'm overlooking?

I added a download test that should fail if the FMI website changes too much, and another test that compares the local copy within the package to a downloaded one to see if it's still up-to-date. If it needs updating, it also prints out the exact command that was used to save the current version, so there is no need to wonder then what the exact parameters were that were used with write.table().

antagomir commented 7 years ago

I have switched to use to tibble also in other packages recently, I would support that. No major issues so far with that.

jlehtoma commented 7 years ago

At the moment, I have the function return a tibble. Is that ok? Compared to returning a vanilla data.frame it's a small added convenience for dplyr users, and I'd imagine for others it should appear > just as a regular data.frame. But is there something I'm overlooking?

Sure, fine by me. The only thing that probably needs to checked (literally R CMD check) that tibble is imported and functions referred to properly so there are no WARNINGs/NOTEs.

I added a download test that should fail if the FMI website changes too much, and another test that
compares the local copy within the package to a downloaded one to see if it's still up-to-date. If it needs updating, it also prints out the exact command that was used to save the current version, so there is no need to wonder then what the exact parameters were that were used with write.table().

Sounds good!

jlehtoma commented 7 years ago

Discussed functionality implemented in d0d8e8614129864f5e38dec7c0fdef3a610b3e88 , closing this issue.

jlehtoma commented 7 years ago

FYI @ilarischeinin , test "local version is up-to-date" is currently failing with the following statement:

`local_stations` not equal to `downloaded_stations`.
Component “Lon”: Mean relative difference: 0.0003702332
Component “Elevation”: Mean relative difference: 0.01240695

At closer inspection, here are the differing rows for Lon and Elevation respectively:

> local_stations[!local_stations$Lon == downloaded_stations$Lon,]
                   Name FMISID LPNN WMO   Lat   Lon Elevation                 Groups Started
60 Inari parish village 102046 9615  NA 68.91 27.01       128 Precipitation stations    2003

> downloaded_stations[!local_stations$Lon == downloaded_stations$Lon,]
                   Name FMISID LPNN WMO   Lat   Lon Elevation                 Groups Started
60 Inari parish village 102046 9615  NA 68.91 27.02       127 Precipitation stations    2003
> local_stations[!local_stations$Elevation == downloaded_stations$Elevation,]
                    Name FMISID LPNN  WMO   Lat   Lon Elevation                 Groups Started
60  Inari parish village 102046 9615   NA 68.91 27.01       128 Precipitation stations    2003
292      Tampere Härmälä 101124 1222 2763 61.47 23.75        85       Weather stations    1997
293   Tampere Siilinkari 101311 2219 2943 61.52 23.75        98       Weather stations    1989
294     Tampere Tampella 151049 2223 2744 61.50 23.76        92       Weather stations    2012

> downloaded_stations[!local_stations$Elevation == downloaded_stations$Elevation,]
                    Name FMISID LPNN  WMO   Lat   Lon Elevation                 Groups Started
60  Inari parish village 102046 9615   NA 68.91 27.02       127 Precipitation stations    2003
292      Tampere Härmälä 101124 1222 2763 61.47 23.75        84       Weather stations    1997
293   Tampere Siilinkari 101311 2219 2943 61.52 23.75        96       Weather stations    1989
294     Tampere Tampella 151049 2223 2744 61.50 23.76        91       Weather stations    2012

Lo and behold, there are actual differences in the values. I have no idea how the table is generated on the FMI end, but I'll re-create the local CSV as per the informative instructions in the test error message.

ilarischeinin commented 7 years ago

Thanks for the info!

Nice to see that the test works, but I wonder what to think about the actual changes. If it's a one-time thing to improve accuracy, that's great. But if there is something in the system they're retrieved from that can cause the numbers to fluctuate slightly, then not so great.

I guess then the comparisons should be done on a per-column basis, and allow some tolerance for some columns (Lat, Lon, Elevation) and be exact for others. That's a trivial addition anyway. I could even add it right away, but on the other hand the continuous tests provide a nice way to get an idea on which one of the two options is at play here.

Interesting that while the change in other values is just by 1 on the last digit, the Elevation for "Tampere Siilinkari" has changed by 2 meters. So, while the others could be explained by rounding, that one cannot.

jlehtoma commented 7 years ago

I agree, it would be nice to know what's causing this. Rounding was what I had in mind as well, but I couldn't figure out how that would work either. I would wait and see whether this is something that happens irregularly. If so, then better that the test catches that. If it something that happens on a regular basis (e.g. table being automatically regenerated with some [hopefully explainable!] fluctuation in the values) then we can consider setting the tolerance.

ilarischeinin commented 7 years ago

Let's do that.

What I had in mind with rounding was if for some reason it's (incorrectly) done twice. 127.46 rounds to 127, but if for some reason it's first rounded to 127.5, and then subsequently rounded again, that could result in 128. So, a change like this could hypothetically happen if someone noticed that they're rounding twice and removed the extra rounding step. (I don't see how it could've resulted in a change to tie-breaking rules, as there are both odd and even numbers on the local and downloaded sides.)

Anyways, since the 98 -> 96 change is there, rounding only cannot explain it.