joshualerickson / wildlandhydRo

Tools for working with wildland hydrology.
Other
8 stars 1 forks source link

Error when running `batch_distribution` function #2

Open tungttnguyen opened 1 year ago

tungttnguyen commented 1 year ago

Hello Josh,

I came across your package and really like its functionality. However, I ran into one issue when applying batch_distribution to USGS Gage 15875000. It looks like the error was due to fitdist failed to estimate mle parameters. Is there anyway to specify desired fitting methods as well as distributions when running batch_distribution? Thank you!

Reprex below

library(e1071)
library(PearsonDS)
library(fitdistrplus)
library(evd)
library(MGBT)
library(smwrBase)
library(whitewater)
library(wildlandhydRo)

# 
peak_usgs <- ww_wyUSGS(sites = 15875000) # Error
#> ✔ 'water year' was successfully downloaded.
#> → now starting to gather peak flows using dataRetrieval::readNWISpeak
#> ✔ 15875000 'peak flows' were successfully downloaded.
peak_usgs <- peak_usgs[!is.na(peak_usgs$peak_va), ]

# Low outlier detection using MGMT
low_thresh <- MGBT(peak_usgs$peak_va)
# get distributions
usgs_dist <- batch_distribution(peak_usgs, peak_va)
#> Warning in fitdist(max.x_lp$x, distr = "lpearsonIII", start = list(meanlog =
#> log_mean.x, : The plpearsonIII function should return a zero-length vector when
#> input has length zero
#> <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth,     lower = lower, upper = upper, ...): non-finite finite-difference value [1]>
#> Error in fitdist(max.x_lp$x, distr = "lpearsonIII", start = list(meanlog = log_mean.x, : the function mle failed to estimate the parameters, 
#>                 with the error code 100

Using a different method for fitdist e.g., "mge": 'maximum goodness-of-fit estimation'


annual_max_log <- log10(peak_usgs$peak_va)
m <- mean(annual_max_log)
v <- var(annual_max_log)
s <- sd(annual_max_log)
g <- e1071::skewness(annual_max_log, type = 1)

n <- length(annual_max_log)
g <- g*(sqrt(n*(n-1))/(n-2))*(1+8.5/n)

my.shape <- (2/g)^2
my.scale <- sqrt(v)/sqrt(my.shape)*sign(g) 
my.location <- m-sqrt(v*my.shape)

my.param <- list(shape=my.shape, scale=my.scale, location=my.location)
my.param
#> $shape
#> [1] 3.396992
#> 
#> $scale
#> [1] -0.07242969
#> 
#> $location
#> [1] 4.93522

dPIII <- function(x, shape, location, scale) PearsonDS::dpearsonIII(x, shape, location, scale, log=FALSE)
pPIII <- function(q, shape, location, scale) PearsonDS::ppearsonIII(q, shape, location, scale, lower.tail = TRUE, log.p = FALSE)
qPIII <- function(p, shape, location, scale) PearsonDS::qpearsonIII(p, shape, location, scale, lower.tail = TRUE, log.p = FALSE)

fitPIIImge <- fitdist(annual_max_log,
                      control = list(trace = 1, REPORT = 1),
                      distr   = "PIII",
                      method  = "mge",
                      start   = my.param)
#> Warning in fitdist(annual_max_log, control = list(trace = 1, REPORT = 1), :
#> maximum GOF estimation has a default 'gof' argument set to 'CvM'
#>   Nelder-Mead direct search function minimizer
#> function value for initial parameters = 3.664270
#>   Scaled convergence tolerance is 5.46019e-08
#> Stepsize computed as 0.493522
#> BUILD              4 3.666020 0.037719
#> LO-REDUCTION       6 3.664270 0.037719
#> HI-REDUCTION       8 3.423292 0.037719
#> LO-REDUCTION      10 3.395512 0.037719
#> EXTENSION        491 0.026998 0.026988
#> REFLECTION       493 0.026994 0.026986
#> EXTENSION        495 0.026992 0.026977
#> EXTENSION        497 0.026988 0.026970
#> LO-REDUCTION     499 0.026986 0.026970
#> Exiting from Nelder Mead minimizer
#>     501 function evaluations used
summary(fitPIIImge)
#> Fitting of the distribution ' PIII ' by maximum goodness-of-fit 
#> Parameters : 
#>            estimate
#> shape    13.1823150
#> scale    -0.0377558
#> location  5.6793337
#> Loglikelihood:  7.38231   AIC:  -8.764621   BIC:  -7.570935
plot(fitPIIImge)

Or using a different package fasstr

library(fasstr)
peak_usgs$Measure <-  "Inst. Peaks"
# use default arguments (log-PIII etc.)
# fit_distr_method | Character string identifying the method used to fit the distribution, one of 'MOM' (method of moments) or 'MLE' (maximum likelihood estimation). 
# Selected as 'MOM' if fit_distr ='PIII' (default) or 'MLE' if fit_distr = 'weibull'.

flood_frequency <- compute_frequency_analysis(data = peak_usgs,
                                              events = peak_dt,
                                              values = peak_va,
                                              measures = Measure,
                                              use_max = TRUE)
str(flood_frequency, max.level = 1)
#> List of 5
#>  $ Freq_Analysis_Data   : tibble [11 × 28] (S3: tbl_df/tbl/data.frame)
#>  $ Freq_Plot_Data       : tibble [11 × 30] (S3: tbl_df/tbl/data.frame)
#>  $ Freq_Plot            :List of 9
#>   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
#>  $ Freq_Fitting         :List of 1
#>   ..- attr(*, "split_type")= chr "data.frame"
#>   ..- attr(*, "split_labels")='data.frame':  1 obs. of  1 variable:
#>  $ Freq_Fitted_Quantiles: tibble [11 × 4] (S3: tbl_df/tbl/data.frame)
flood_frequency$Freq_Plot

Created on 2023-10-24 with reprex v2.0.2

joshualerickson commented 1 year ago

Hey @tungttnguyen thanks for submitting this issue! I've really neglected this package and have even thought about deprecating but I keep coming back to it and recently have been using the distribution functionality; however, like you've found it's hard to get the mle to work with pearson type 3 distributions (log as well). It's even harder to abstract the nuances needed when doing analysis and I've been thinking on how to do that but looks like a whole rewrite of the code-base and new package might be the approach.

The package is a kitchen sink of things I was doing a few years back but I've slowly started to branch out (whitewater is a branch of this) and maybe it's time to do the same with the distribution functionality. I'd be open to suggestions and if you'd like to collaborate please let me know :). In the mean time, it looks like the {fasstr} package will only run 'mme' on when 'PIII' is the fit_distr, which then doesn't have any convergence errors and why 'mge' is also working. Since each fitdist() call in {wildlandhydRo} has ... you can pass something like below and it will work. But, this might not be the best parameter estimation for your data per distribution, which leads ... abstraction hard to customize...

# get distributions
usgs_dist <- batch_distribution(peak_usgs, peak_va, 'mge')

The {fasstr} package looks really sweet and really well done! If you're wanting to just use pearson and weibull then I'd recommend going with that or some other workflow you have but maybe in the near future I'll get something going with {whitewater} and {someFreqPackage} to be more user friendly, etc.

Thanks!

tungttnguyen commented 1 year ago

Thank you very much for the detailed response! I was able to make LP3 work by modifying your code to allow customizing for LP3 method only. Like you said, I found that mle worked great most of the time except for LP3 distribution. I am happy to make a PR using the small fix that I made. And I am interested in collaborating so please let me know if you'd like to discuss more

joshualerickson commented 1 year ago

Hey @tungttnguyen, I think I'm going to start a new repository and transfer the distribution functions over to it. However, feel free to make a PR as that will then get transferred in the process. From there, I really want to refactor a lot of the code particularly with the preparation of data in addition to visualizing. fitdistrplus and other packages do all the heavy lifting but maybe this package could help with prepping and bringing together different techniques for parameter estimation.