ropensci / rnoaa

R interface to many NOAA data APIs
https://docs.ropensci.org/rnoaa
Other
330 stars 84 forks source link

isd_stations_search() is slow #157

Closed lukas-rokka closed 8 years ago

lukas-rokka commented 8 years ago

Using is_stations_search() takes roughly 20 seconds. Which is too long to be useful in production code.

Here is solution that I used my self that takes less than 0.1 second to compute. I gets the nearest stations with a reasonable accuracy, the distance and don't require loading external packages.

library(rnoaa)
library(dplyr)

# Get station list and clean lon/lat
stations <- rnoaa::isd_stations() %>% 
  filter(!(is.na(lat) | is.na(lon) | (lat==0 & lon==0) | abs(lon) > 180 | abs(lat) > 90))

plot(stations %>% select(lon, lat) %>% as.matrix(ncol=2)) # plots a cool world map

loc_to_search <- c(18, 60) # lon, lat coordinates to search for

# Calc. distance, in m,  between loc_to_search and stations coordinates,
# and show 5 nearest stations 
stations %>%
  mutate(lat_rad = lat*pi/180, 
         grid_size_lat = 111132.954 - 559.822*cos(2*lat_rad) +  1.175*cos(4*lat_rad),
         grid_size_lon = grid_size_lat*cos(lat_rad),
         dist=sqrt(((loc_to_search[1] - lon)*grid_size_lon)^2 + ((loc_to_search[2] - lat)*grid_size_lat)^2)) %>%
  select(-lat_rad, -grid_size_lat, -grid_size_lon) %>%
  arrange(dist) %>% slice(1:5)
sckott commented 8 years ago

thanks, having a look ...

sckott commented 8 years ago

@lukas-rokka Can you explain in english what the calculation of grid_size_lat and grid_size_lon are? And would you be able to do a bounding box search with this setup?

lukas-rokka commented 8 years ago

At the equator the size of 1° is the same both in longitude and latitude, roughly 111 km. At 90° latitudes the size of 1° longitude is zero, while the latitude is relatively constant. At 60°, the grid size is roughly 111 km in latitude and 66 km in longitude. I don't remember the reference to the trigonometric calculations, but I think it's an approximation that accounts for earth being a spheroid (not a sphere).

The dist=sqrt(((loc_to_search[1] - lon)*grid_size_lon)^2 + ((loc_to_search[2] - lat)*grid_size_lat)^2)) calculates the distance in meters between two coordination's points. It is fairly accurate as long as the two points are not too far away from each other. It calculates the grid size at the position of the stations.

Here an example that calculates the grid size as the average of the station and the search point, which gives a more accurate result for points far apart.

library(rnoaa)
library(geosphere)
library(dplyr)
library(leaflet)

stations <- rnoaa::isd_stations() %>% 
  filter(!(is.na(lat) | is.na(lon) | (lat==0 & lon==0) | abs(lon) > 180 | abs(lat) > 90))

add_distances <- function(stations, search_lat, search_lon, max_dist) {
  search_lat_rad <- search_lat*pi/180
  grid_size_search_lat <- 111132.954 - 559.822*cos(2*search_lat_rad) +  1.175*cos(4*search_lat_rad) # 1° lat size at search lat, in m
  grid_size_search_lon <- grid_size_search_lat*cos(search_lat_rad) # 1° lon size at search lat, in m
  stations %>%
    mutate(lat_rad = lat*pi/180, 
           grid_size_lat = 111132.954 - 559.822*cos(2*lat_rad) +    1.175*cos(4*lat_rad), # 1°lat size at stations lat
           grid_size_lon = grid_size_lat*cos(lat_rad), # 1° lon size at stations lat
           grid_size_lat = (grid_size_lat + grid_size_search_lat)/2, # mean of search and stations lat
           grid_size_lon = (grid_size_lon + grid_size_search_lon)/2, # mean of search and stations lon
           dist=0.001*sqrt(((search_lon - lon)*grid_size_lon)^2 + ((search_lat - lat)*grid_size_lat)^2)) %>% # the distance in km
    select(-lat_rad, -grid_size_lat, -grid_size_lon) %>%
    filter(dist <= max_dist) %>%
    return()
}

# get data frame of stations in the 200 km radius of the seach lat/lon.  
# Takes less than 0.05 sec on my computer
nearest_st <- stations %>% add_distances(.,60,18,200)

# plot them on map
leaflet() %>%
  addTiles() %>% 
  addCircles(lng = nearest_st$lon,
             lat = nearest_st$lat,
             popup = nearest_st$station_name) %>% 
  clearBounds()

for the bounding box calculation grid sizes shouldn't be an issue? Should be simple as:

# get stations within 50-70 latitudes and 0-20 longitudes
bbox_st <- stations %>% filter(lat>=50 & lat<=70, lon>=0 & lon <=20) 

leaflet() %>% addTiles() %>% 
  addCircles(lng= bbox_st$lon,
             lat = bbox_st$lat,
             popup = bbox_st$station_name) %>% 
  clearBounds()

You don't have an interface to the isd-lite, right? (ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite) I might do an lightweight interface to that, with async downloading, as I need an implementation that gets the most important weather parameters for multi-year periods fairly efficiently. Is there an implementation in the rOpenSci that deals with imputing missing weather parameters? What I want is to be able to download data from a certain station and if there are longer periods of missing data I would like to impute them from nearby stations.

sckott commented 8 years ago

Thanks for the details on the what the calculations do. I can write a draft replacement function with what you've done on a new branch and get your feedback on it?

You don't have an interface to the isd-lite, right?

no, but someone mentioned it here too https://github.com/ropensci/rnoaa/issues/76#issuecomment-104121544 - happy to incorporate here

async

there are async features in the curl package

Is there an implementation in the rOpenSci that deals with imputing missing weather parameters?

No, be interested to add that if it's not a ton of code

lukas-rokka commented 8 years ago

I can write a draft replacement function with what you've done on a new branch and get your feedback on it?

Sure, that would be great.

No, be interested to add that if it's not a ton of code

Well, it could easily end up being "tons of code". But would be a very useful feature. There are some packages (e.g. MICE, hmisc::aregImpute()) that deals with missing values. But I think there are some domain specific issues that can take while to deal with, like which nearby stations that can be used (the nearest ones, does it need to be on same elevation or some other criteria?)

I've a Shiny app, that uses European gridded weather data sources (guess some of them could be interesting for rOpenSci as well, like CAMS for solar radiations data) and turns the data into time series formats useable for building simulations tools. It's for this tool I like to use the isd data.

sckott commented 8 years ago

Sure, that would be great.

okay, will start that

imputing missing weather parameters

Can you open up a new issue for this, and we can continue discussion there

European gridded weather data sources (guess some of them could be interesting for rOpenSci as well, like CAMS for solar radiations data)

This pkg is focusing on NOAA data - I assume those data sources aren't from NOAA? We can definitely start one or more new pkgs though

sckott commented 8 years ago

@lukas-rokka try out the new fxn, reinstall from here and let me know what you think

lukas-rokka commented 8 years ago

@sckott it tested it for functionality and speed, it works great.

A comment of overlapping: there's a function meteo_spherical_distance() (from the latest pull?) in meteo_distance.R that calculates the distance with the haversine forumula (slightly more elegant than my version that projects our round earth to a flat map and then do the distance calc). You can now search the nearest ISD stations with new functionalities from meteo_distance.R as well:

# using meteo_distance.R
isd_stations() %>% 
  rename(latitude=lat, longitude=lon) %>%
  meteo_distance(60, 18, radius = 200) %>% 
  filter(distance < 50))
# Returns the same result as 
isd_stations_search(60, 18, 50) %>%
  arrange(dist)

Both works fine and the performance is the same. But would maybe be cleaner to make use of meteo_distance.R? And not so good with different variable name conventions within the package (latitude=lat, longitude=lon, distance=dist).

sckott commented 8 years ago

@lukas-rokka good point. yes those fxns are from the latest PR. We should keep isd_stations_search, but we could use meteo_distance internally within that fxn to do the actual work. sound good?

lukas-rokka commented 8 years ago

It's up to you @sckott, and what you think gives more maintainable code. From an user perspective it doesn't really matter how it works internally as long it works. But good to keep the isd_stations_search as it will come up when you start typing isd_

sckott commented 8 years ago

@lukas-rokka made the changes, reinstall and try again