ropensci / tidyhydat

An R package to import Water Survey of Canada hydrometric data and make it tidy
https://docs.ropensci.org/tidyhydat
Apache License 2.0
70 stars 19 forks source link

How to fill in gaps for the monthly flow values #118

Closed rajibshibly closed 5 years ago

rajibshibly commented 5 years ago

This is not a bug but a general issue and I was wondering if anyone have any thought or idea about how to deal with it.

When retrieving monthly flow values for some of the Newfoundland and Labrador stations, I found many stations consist gaps in monthly flow values. An example is attached. I was wondering if anyone can forward me an R function that is appropriate for filling time series monthly missing values.

Thanks in advance. monthly_flow_gap_values_hydat

rywhale commented 5 years ago

Strange that some of this 'missing' data is present if you get at it via the WSC historic data portal.

For instance, 02YA002 is not missing data on the website for these months

I don't have hydat on this machine but can investigate further soon.

In the mean time, @rajibshibly can you post a reproducible example (code snippet) for getting the table you uploaded?

rajibshibly commented 5 years ago

At first I tried the hy_monthly_flows function for e.g.:


monthly_flows_03OC003 <- hy_monthly_flows(station_number ="03OC003", start_date = "1998-10-01", end_date = "1998-12-31")

I ended up with the error: Error in UseMethod("escape") : no applicable method for 'escape' applied to an object of class "function"

Please find the code below:

#Function which is oppostive of %in%
'%!in%' <- function(x,y)!('%in%'(x,y))

#NL_stns stores all hydat stations which are active and from the province of NL
NL_Stns_Active <- hy_stations() %>%
  filter(HYD_STATUS == "ACTIVE") %>%
  filter(PROV_TERR_STATE_LOC == "NL")

#All stations which are regulated
All_Stns_Reg <- hy_stn_regulation()%>%
  filter(REGULATED == "TRUE")

#Filtering all NL regulated stations using %in% keyword
NL_Stns_Reg <- NL_Stns_Active[NL_Stns_Active$STATION_NUMBER %in% All_Stns_Reg$STATION_NUMBER,]

#Filtering all active and unregulated stations using the function %!in%
NL_Stns_Act_Unreg <- NL_Stns_Active[NL_Stns_Active$STATION_NUMBER %!in% NL_Stns_Reg$STATION_NUMBER,]

#Retrieve date range and number of years of data for active and unregulated flow stations
NL_Stns_Range <- hy_stn_data_range(station_number = NL_Stns_Act_Unreg$STATION_NUMBER)%>%
  filter(DATA_TYPE == "Q")

#Merge the Active and Unregulated station with the date range
NL_Stns <- merge(NL_Stns_Act_Unreg, NL_Stns_Range, by.x= "STATION_NUMBER")

#Those station which does not have elevation information
Left_Out_Stns <- NL_Stns[NL_Stns$STATION_NUMBER %!in% NL_Flow$STATION_NUMBER,]

#Find the row number for Rattling Brook Below Plant Discharge
rowNum <- which(NL_Stns$STATION_NUMBER == "02ZK008")

#Add the gross drainage area for Rattling Brook Below Plant Discharge
NL_Stns$DRAINAGE_AREA_GROSS[rowNum] = 36.082

#Select certain fields for the NL_Stns
NL_Stns <- select(NL_Stns, "STATION_NUMBER", "STATION_NAME", "PROV_TERR_STATE_LOC", "LATITUDE", "LONGITUDE", "DRAINAGE_AREA_GROSS", "Year_from", "Year_to", "RECORD_LENGTH")

#Seperate between NF and Labrador Stations
for (row in 1:nrow(NL_Stns)){
  if (substr(NL_Stns$STATION_NUMBER[row], 1, 3) == "02X"){NL_Stns$PROV_TERR_STATE_LOC[row]="LAB"
  }else if (substr(NL_Stns$STATION_NUMBER[row], 1, 2) == "03"){NL_Stns$PROV_TERR_STATE_LOC[row]="LAB"
  }else{NL_Stns$PROV_TERR_STATE_LOC[row]="NF"}
}

#Only keep rows which have RECORD_LENGTH of 17 years or greater.
NL_Stns <- filter (NL_Stns, RECORD_LENGTH >= 17)

NL_Stns_daily_flows <- hy_daily_flows(NL_Stns$STATION_NUMBER, start_date = "1996-01-01", end_date = "2015-12-31")

#Removing the NA values i.e. blank values
NL_Stns_daily_flows <- drop_na(NL_Stns_daily_flows)

NL_Stns_daily_flows$Month <- Month(NL_Stns_daily_flows$Date)

NL_Stns_daily_flows$Year <- Year(NL_Stns_daily_flows$Date)

#  Aggregate on months and year and get monthly mean temperature
Mean_Monthly_Flow_NL <- NL_Stns_daily_flows %>%
  #group_by groups the data by month and year
  group_by(STATION_NUMBER, Year, Month) %>%

  mutate(grouped_id = row_number()) %>%
  #summarise computes the mean monthly temperature
  summarise(mean_flow = mean(Value, na.rm = TRUE))

#Mean_Monthly_Flow_NL <- select(Mean_Monthly_Flow_NL, "STATION_NUMBER", "mean_flow", "Year") 

#Mean_Monthly_Flow_NL_Count <- count(Mean_Monthly_Flow_NL, "Year")

#Put the data into wide format
Mean_Monthly_Flow_Wide <- spread(Mean_Monthly_Flow_NL, STATION_NUMBER,  mean_flow)
rywhale commented 5 years ago

@rajibshibly Looks to be an issue with however you munged the data. If I grab monthly means specifically for 02YA002 the issue doesn't exist...

library(tidyhydat)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

bartletts_mean <- hy_monthly_flows(
  station_number = "02YA002"
) %>%
  filter(
    Sum_stat == "MEAN"
  )
#> No start and end dates specified. All dates available will be returned.
#> All station successfully retrieved

sum(is.na(bartletts_mean$Value))
#> [1] 0

bartletts_mean %>%
  filter(Year == 1996)
#> # A tibble: 12 x 8
#>    STATION_NUMBER  Year Month Full_Month No_days Sum_stat Value
#>    <chr>          <int> <int> <lgl>        <int> <chr>    <dbl>
#>  1 02YA002         1996     1 TRUE            31 MEAN     0.409
#>  2 02YA002         1996     2 TRUE            29 MEAN     0.258
#>  3 02YA002         1996     3 TRUE            31 MEAN     0.546
#>  4 02YA002         1996     4 TRUE            30 MEAN     1.74 
#>  5 02YA002         1996     5 TRUE            31 MEAN     4.29 
#>  6 02YA002         1996     6 TRUE            30 MEAN     1.88 
#>  7 02YA002         1996     7 TRUE            31 MEAN     2.17 
#>  8 02YA002         1996     8 TRUE            31 MEAN     0.438
#>  9 02YA002         1996     9 TRUE            30 MEAN     0.504
#> 10 02YA002         1996    10 TRUE            31 MEAN     1.64 
#> 11 02YA002         1996    11 TRUE            30 MEAN     1.80 
#> 12 02YA002         1996    12 TRUE            31 MEAN     1.91 
#> # ... with 1 more variable: Date_occurred <date>

Created on 2019-01-15 by the reprex package (v0.2.1)

rajibshibly commented 5 years ago

Not sure what munge means but looks like this works! ... I shall give this a try...

Thanks