Closed dgustafsson closed 2 years ago
Interesting, thanks David! I made one, too (not completely finished yet). :) I'll compare and comment here.
David, can you attach an example simass.txt file here? I work with an older version (which fails with your function) and want to make the function version-safe rightaway.
I couldn't import my example simass.txt files with your functions, and continued with my version instead. Attribute names differ a little, and I am not quite sure how your data frame looks like after import (did not go through your code line-by-line).
ReadSimass() now yields a data frame with this structure (example file with two evaluated variable pairs)
'data.frame': 18 obs. of 3 variables:
$ NAME : chr "Regional NSE" "Regional RA" "Regional RE" "Regional MAE" ...
$ RET1.CCT1: num -0.656 NA NA 5.405 -0.656 ...
$ ROUT.COUT: num -0.632 NA 0.403 1.151 -0.632 ...
- attr(*, "variable.obs")= chr [1:2] "RET1" "ROUT"
- attr(*, "variable.sim")= chr [1:2] "CCT1" "COUT"
- attr(*, "agg.period")= num [1:2] 1 1
- attr(*, "n.simulation")= num 1
- attr(*, "crit.total")= num 1.29
- attr(*, "crit.conditional")= num 0
- attr(*, "threshold")= num 0
The function is in master now, I also paste the code here for quick testing:
ReadSimass <- function(filename = "simass.txt") {
# file header with simulation info
fhead <- readLines(con = filename, n = 5)
# determine number of header rows
if(nchar(gsub(" ", "", fhead[5])) == 0) {
n.head <- 5
} else {
n.head <- 4
}
# import data content
te <- read.table(file = filename, sep = ":", fill = T, skip = n.head, header = F, stringsAsFactors = F, blank.lines.skip = F)
# calculate number of performance measures and variable pairs for which performance measures are present (different numbers of
# perf meas possible, depending on HYPE version)
# conditional: treat case with only 1 variable pair separately (diff() does not work then)
te.perf <- which(gsub(pattern = "[[:blank:]]", replacement = "", x = te[, 1]) == "Variables")
if (length(te.perf) == 1) {
n.perf <- nrow(te)
} else {
n.perf <- diff(te.perf)[1]
}
n.var <- nrow(te) / n.perf
# extract performance measures for each variable pair into data frame
res <- data.frame(matrix(unlist(tapply(te[, 2], rep(1:n.var, each = n.perf),
function(x) {te <- as.numeric(c(x[3:(n.perf - 1)])); ifelse(te == -9999, NA, te)})),
nrow = n.perf - 3, byrow = F))
# add performance measure name as first column, strip from superfluous space characters along the way
res <- cbind(NAME = gsub(pattern = "^ *|(?<= ) | *$", replacement = "", te[, 1][3:(n.perf - 1)], perl = TRUE), res)
# move data numbers for regional and mean/median criteria to attribute vectors, if they exist
if (res[nrow(res) - 1, 1] == "Number of data for regional criterion") {
attr(res, "n.data.regional") <- as.numeric(res[nrow(res) - 1, -1])
res <- res[-(nrow(res) - 1), ]
}
if (res[nrow(res), 1] == "Number of areas in mean/median criterion") {
attr(res, "n.area.mean") <- as.numeric(res[nrow(res), -1])
res <- res[-(nrow(res)), ]
}
# add HYPE variables as attributes and column names
hvar <- strsplit(gsub(pattern = "^ *", replacement = "", te[, 2][seq(from = 1, length.out = n.var, by = n.perf)]), ", ")
hvar <- matrix(toupper(unlist(hvar)), ncol = 2, byrow = T)
attr(res, "variable.obs") <- hvar[, 1]
attr(res, "variable.sim") <- hvar[, 2]
names(res)[-1] <- apply(hvar, 1, paste, collapse = ".")
## add aggregation periods as attributes, with conversion of letter code to standard HYPE numeric codes (as used in info.txt)
# extract information
agg.period <- te[, 2][seq(from = 2, length.out = n.var, by = n.perf)]
# conditional: check if periods are given as numeric, and if not, try to convert from predefined character codes,
# as last option return whatever was found in file and return a warning
if (any(is.na(suppressWarnings(as.numeric(agg.period))))) {
# periods are coded as characters, try to convert to numeric code
te.agg <- gsub(pattern = "^ *", replacement = "", agg.period)
if (all(te.agg %in% c("DD", "WW", "MM", "YY"))) {
te.agg <- factor(te.agg, levels = c("DD", "WW", "MM", "YY"))
levels(te.agg) <- 1:4
te.agg <- as.numeric(te.agg)
attr(res, "agg.period") <- te.agg
} else {
attr(res, "agg.period") <- te.agg
warning("Aggregation period code unknown. Returned as imported in attribute 'agg.period'.")
}
} else {
# periods are coded numerically, just return the numbers
attr(res, "agg.period") <- as.numeric(agg.period)
}
## add general simulation attributes from simass.txt file header
# simulation number (in case of monte carlo results)
attr(res, "n.simulation") <- as.numeric(substr(fhead[2], 20, 100))
# conditional on number of header rows
if (n.head == 5) {
te.head <- as.numeric(na.omit(suppressWarnings(as.numeric(strsplit(paste(fhead[3:4], collapse = " "), split = " +|,")[[1]]))))
} else {
te.head <- as.numeric(na.omit(suppressWarnings(as.numeric(strsplit(fhead[3], split = " +|,")[[1]]))))
}
# total criteria value
attr(res, "crit.total") <- te.head[1]
# conditional criteria value
attr(res, "crit.conditional") <- te.head[2]
# threshold
attr(res, "threshold") <- te.head[3]
return(res)
}
@dgustafsson Closing for now, please re-open if it does not work for you.
hi Rene, I was busy using my function on some work so couldnt follow here. Good that you made your own version. Lets continue with this one if it manage to read from different versions of simass.
Attached is an update on my function (the initial one had some bug in it) and a simass that it can read.
I tried to used the HYPE criteria abbreviations (the ones you normally find in allsim.txt) as column names in the dataframe, and provided a columns-key with the abbreviation and the full write out definition as given in Simass.txt as one of the attributes.
Otherwise I think its pretty much the same. /David readsimass.txt simass.txt
actually, I think our data frames are transposed. Which one is more useful can be discussed, maybe we could have both methods as option? In your case you have the variable-pair as columns names, and provide the criteria definition in the first NAME column. So it might be some work to findout in which row-column a specific criteria is located for a specific evaluation. In my case, I use the HYPE criteria abbreviations as columns names, and put one line for each criteria. So I can directly read out a specific criteria from a given column. But I still need to check on which row I have a specific evaluation variable pair.
Moreover, my function is not general as it is now - I just made it work on the files I work on for the moment.
Why do you "move data numbers for regional and mean/median criteria to attribute vectors, if they exist" - they could as well be in the data frame for easier access.
'data.frame': 10 obs. of 27 variables:
$ CRIT : num 0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
$ rvar : chr "rgqrin" "xom0" "xom1" "xom2" ...
$ cvar : chr "rgqrin" "xom0" "xom1" "xom2" ...
$ period : chr "DD" "DD" "DD" "DD" ...
$ RR2 : num 0.7139 0.2618 0.0743 -0.2736 0.1986 ...
$ RRA : num -9999 -9999 -9999 -9999 -9999 ...
$ RRE : num -0.0734 -0.1276 -0.2119 -0.0176 -0.1266 ...
$ RMAE : num 16 142 241 198 210 ...
$ MR2 : num 7.14e-01 -1.59e+06 -1.36e+06 -1.01e+08 -6.19e+06 ...
$ MRA : num -9999 -9999 -9999 -9999 -9999 ...
$ MRE : num -0.0734 -0.09 -0.0974 0.106 -0.0317 ...
$ MRS : num -0.231 1.409 1.614 1.616 1.748 ...
$ MCC : num 0.85 0.494 0.649 0.324 0.784 ...
$ MAR : num 0.0734 0.1853 0.2394 0.3238 0.2316 ...
$ AKG : num 0.715 -1.01 -1.066 -1.289 -1.042 ...
$ ASCKG : num 0.5563 -0.0642 0.0309 -0.1283 0.0656 ...
$ SR2 : num -9999 -9999 -9999 -9999 -9999 ...
$ SRA : num -9999 -9999 -9999 -9999 -9999 ...
$ SRE : num -9999 -9999 -9999 -9999 -9999 ...
$ TAU : num -9999 -9999 -9999 -9999 -9999 ...
$ MD2 : num -9999 -9999 -9999 -9999 -9999 ...
$ MDA : num -9999 -9999 -9999 -9999 -9999 ...
$ MKG : num 0.714 -15.961 -29.733 -26.346 -4.828 ...
$ MNR : num -9999 -9999 -9999 -9999 -9999 ...
$ MNW : num 0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
$ NDATREG: num 701 53 58 36 55 ...
$ NSUBME : num 1 27 32 18 28 20 19 21 5 496
- attr(*, "simulation.number")= num 1
- attr(*, "total.criteria")= num 0.532
- attr(*, "number.of.criterions")= num 10
- attr(*, "individual.criterions")= num 0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
- attr(*, "column.key")='data.frame': 29 obs. of 2 variables:
..$ short: chr "CRIT" "rvar" "cvar" "period" ...
..$ long : chr "Criteria value" "observed variable" "simulated variable" "aggregation period" ...
Why do you "move data numbers for regional and mean/median criteria to attribute vectors, if they exist" - they could as well be in the data frame for easier access.
Formal reasons only: they aren't criteria. But I have no strong opinion on this, and can leave them in the dataframe for faster access. Comes with next update...
actually, I think our data frames are transposed. Which one is more useful can be discussed, maybe we could have both methods as option? In your case you have the variable-pair as columns names, and provide the criteria definition in the first NAME column. So it might be some work to findout in which row-column a specific criteria is located for a specific evaluation. In my case, I use the HYPE criteria abbreviations as columns names, and put one line for each criteria. So I can directly read out a specific criteria from a given column. But I still need to check on which row I have a specific evaluation variable pair.
Moreover, my function is not general as it is now - I just made it work on the files I work on for the moment.
One advantage with your sorting is that it is comparable to subass dataframes, just with row-wise variable pairs instead of subbasins. Since you are the only user so far, I will adapt the function to your structure.
Btw, do you pull from master, or do you use a release version of HYPEtools? Just asking to see if you can test the packaged function immediately.
Your simass has a header part "Individual criterions" which ReadSimass does not handle because I didn't know it existed. The HYPE wiki pages should be a bit more detailed for this one.
Yes, Lotta makes her best to keep the wiki up-to-date with the HYPE code - maybe this new header is just in the trunk now and not yet in a release version.
I usually install release version of the HYPEtools - but I could try to pull from the master.
also, please note that I had to make up some of the criteria abbreviations myself, since they are currently only in the simass and not in the allsim. But most of them are there already.
also, please note that I had to make up some of the criteria abbreviations myself, since they are currently only in the simass and not in the allsim. But most of them are there already.
I'll just copy your list... :) :+1:
I usually install release version of the HYPEtools - but I could try to pull from the master.
Yes, please!
Function updated. @dgustafsson please check and close this issue if it works for you (take also a look at ?ReadSimass
). Note that some names are slightly different, and I chose to only add the long names as attribute, since the short names are already the in there as column names.
> library(HYPEtools)
> te <- ReadSimass(filename = "simass_david.txt")
> te; str(te)
VAR.OBS VAR.SIM PERIOD CRIT RR2 RRA RRE RMAE MR2 MRA MRE MRS MCC MAR
1 RGQRIN RGQCIN 1 0.864076 0.7912192 NA -0.0003572253 12.4628900 7.912192e-01 NA -0.0003572253 -0.07986564 0.8900149 0.0003572253
2 XOM0 SNOW 1 0.042483 -0.4324844 NA 0.0894796400 197.2580000 -8.702523e+05 NA 0.1416273000 0.94831880 0.6801058 0.2692583000
3 XOM1 SNOW 1 -0.706594 -0.1238468 NA 0.0300026440 251.2287000 -3.174128e+06 NA 0.1765938000 2.11882000 0.5199575 0.3620609000
4 XOM2 SNOW 1 0.028393 -2.5348170 NA 0.3168145000 307.7719000 -7.523222e+06 NA 0.4561169000 0.45876510 0.7805144 0.5986000000
5 XOM3 SNOW 1 -0.793188 -1.1334970 NA 0.2943464000 303.8858000 -1.060366e+07 NA 0.4418611000 2.13051200 0.1121179 0.5308743000
6 XOM4 SNOW 1 -0.758312 -1.0691030 NA 0.7006303000 496.1250000 -1.279592e+06 NA 1.7223780000 2.39803700 0.5797362 1.7978190000
7 XOM5 SNOW 1 0.352031 0.3068787 NA 0.0302929580 152.0165000 -2.176184e+06 NA 0.1311511000 1.32106300 1.0000010 0.2775741000
8 XOM6 SNOW 1 -0.107132 -0.6741910 NA -0.2224605000 207.6275000 -1.079606e+07 NA -0.0314483610 1.94134700 0.2888376 0.3788203000
9 RSNW SDEP 1 -0.315323 -0.5683091 NA 0.9155864000 27.1072400 -1.298777e+00 NA 1.8147230000 0.88713530 0.8348566 1.8147230000
10 RFSC CFSC 1 0.877641 0.7816473 NA -0.0905570240 0.1031572 7.012007e-01 NA -0.0804604070 -0.01211407 0.9053226 0.1005598000
AKG ASCKG SR2 SRA SRE SBIAS SRMSE TAU MD2 MDA MKG MNR MNW NDATREG NSUBME
1 0.8640758 0.76068100 NA NA NA NA NA NA 0.7912192 NA 0.86407580 0.06209185 0.7912193 730 1
2 -0.5090607 -0.00758864 NA NA NA NA NA NA -2.8428130 NA 0.04248300 0.28092030 -3.8319060 61 31
3 -1.6216440 -0.10323430 NA NA NA NA NA NA -6.5355150 NA -0.70659380 0.34778230 -19.5315200 63 31
4 -0.3509475 0.06487172 NA NA NA NA NA NA -3.7858330 NA 0.02839289 0.33119350 -0.1582873 38 20
5 -1.7820260 -0.26951890 NA NA NA NA NA NA -25.3475600 NA -0.79318770 0.47767260 -20.7095400 58 30
6 -2.2894860 -0.22568740 NA NA NA NA NA NA -9.3490020 NA -0.75831190 0.50588980 -19.3249400 49 25
7 -0.5387785 0.15672500 NA NA NA NA NA NA -0.4864217 NA 0.35203140 0.23205310 -7.6148040 56 29
8 -1.8229060 -0.01845478 NA NA NA NA NA NA -4.6886790 NA -0.10713190 0.25110890 -31.8774000 53 25
9 -1.0622980 -0.21341400 NA NA NA NA NA NA -0.9852419 NA -0.31532300 0.36677150 -0.6400305 2753 5
10 0.8475954 0.75212010 NA NA NA NA NA NA 0.8211232 NA 0.87764060 0.18951570 0.8110875 143582 496
'data.frame': 10 obs. of 29 variables:
$ VAR.OBS: chr "RGQRIN" "XOM0" "XOM1" "XOM2" ...
$ VAR.SIM: chr "RGQCIN" "SNOW" "SNOW" "SNOW" ...
$ PERIOD : num 1 1 1 1 1 1 1 1 1 1
$ CRIT : num 0.8641 0.0425 -0.7066 0.0284 -0.7932 ...
$ RR2 : num 0.791 -0.432 -0.124 -2.535 -1.133 ...
$ RRA : num NA NA NA NA NA NA NA NA NA NA
$ RRE : num -0.000357 0.08948 0.030003 0.316814 0.294346 ...
$ RMAE : num 12.5 197.3 251.2 307.8 303.9 ...
$ MR2 : num 7.91e-01 -8.70e+05 -3.17e+06 -7.52e+06 -1.06e+07 ...
$ MRA : num NA NA NA NA NA NA NA NA NA NA
$ MRE : num -0.000357 0.141627 0.176594 0.456117 0.441861 ...
$ MRS : num -0.0799 0.9483 2.1188 0.4588 2.1305 ...
$ MCC : num 0.89 0.68 0.52 0.781 0.112 ...
$ MAR : num 0.000357 0.269258 0.362061 0.5986 0.530874 ...
$ AKG : num 0.864 -0.509 -1.622 -0.351 -1.782 ...
$ ASCKG : num 0.76068 -0.00759 -0.10323 0.06487 -0.26952 ...
$ SR2 : num NA NA NA NA NA NA NA NA NA NA
$ SRA : num NA NA NA NA NA NA NA NA NA NA
$ SRE : num NA NA NA NA NA NA NA NA NA NA
$ SBIAS : num NA NA NA NA NA NA NA NA NA NA
$ SRMSE : num NA NA NA NA NA NA NA NA NA NA
$ TAU : num NA NA NA NA NA NA NA NA NA NA
$ MD2 : num 0.791 -2.843 -6.536 -3.786 -25.348 ...
$ MDA : num NA NA NA NA NA NA NA NA NA NA
$ MKG : num 0.8641 0.0425 -0.7066 0.0284 -0.7932 ...
$ MNR : num 0.0621 0.2809 0.3478 0.3312 0.4777 ...
$ MNW : num 0.791 -3.832 -19.532 -0.158 -20.71 ...
$ NDATREG: num 730 61 63 38 58 ...
$ NSUBME : num 1 31 31 20 30 25 29 25 5 496
- attr(*, "names.long")= chr [1:29] "Observed variable" "Simulated variable" "Aggregation period" "Individual criterion" ...
- attr(*, "n.simulation")= num 1
- attr(*, "crit.total")= num 0.516
- attr(*, "crit.conditional")= num 0
- attr(*, "threshold")= num 0
I made a readsimass function. It probably only works for the current simass format, but it can be developed to be more general for older and future formats.
readsimass.txt