ReadSimass - Githubissues

dgustafsson commented 3 years ago

I made a readsimass function. It probably only works for the current simass format, but it can be developed to be more general for older and future formats.

readsimass.txt

rcapell commented 3 years ago

Interesting, thanks David! I made one, too (not completely finished yet). :) I'll compare and comment here.

rcapell commented 3 years ago

David, can you attach an example simass.txt file here? I work with an older version (which fails with your function) and want to make the function version-safe rightaway.

rcapell commented 3 years ago

I couldn't import my example simass.txt files with your functions, and continued with my version instead. Attribute names differ a little, and I am not quite sure how your data frame looks like after import (did not go through your code line-by-line).

ReadSimass() now yields a data frame with this structure (example file with two evaluated variable pairs)

'data.frame':   18 obs. of  3 variables:
 $ NAME     : chr  "Regional NSE" "Regional RA" "Regional RE" "Regional MAE" ...
 $ RET1.CCT1: num  -0.656 NA NA 5.405 -0.656 ...
 $ ROUT.COUT: num  -0.632 NA 0.403 1.151 -0.632 ...
 - attr(*, "variable.obs")= chr [1:2] "RET1" "ROUT"
 - attr(*, "variable.sim")= chr [1:2] "CCT1" "COUT"
 - attr(*, "agg.period")= num [1:2] 1 1
 - attr(*, "n.simulation")= num 1
 - attr(*, "crit.total")= num 1.29
 - attr(*, "crit.conditional")= num 0
 - attr(*, "threshold")= num 0

The function is in master now, I also paste the code here for quick testing:

ReadSimass <- function(filename = "simass.txt") {

  # file header with simulation info
  fhead <- readLines(con = filename, n = 5)

  # determine number of header rows
  if(nchar(gsub(" ", "", fhead[5])) == 0) {
    n.head <- 5
  } else {
    n.head <- 4
  }

  # import data content
  te <- read.table(file = filename, sep = ":", fill = T, skip = n.head, header = F, stringsAsFactors = F, blank.lines.skip = F)

  # calculate number of performance measures and variable pairs for which performance measures are present (different numbers of 
  # perf meas possible, depending on HYPE version)
  # conditional: treat case with only 1 variable pair separately (diff() does not work then)
  te.perf <- which(gsub(pattern = "[[:blank:]]", replacement = "", x = te[, 1]) == "Variables")
  if (length(te.perf) == 1) {
    n.perf <- nrow(te)
  } else {
    n.perf <- diff(te.perf)[1]
  }
  n.var <- nrow(te) / n.perf

  # extract performance measures for each variable pair into data frame
  res <- data.frame(matrix(unlist(tapply(te[, 2], rep(1:n.var, each = n.perf), 
                                         function(x) {te <- as.numeric(c(x[3:(n.perf - 1)])); ifelse(te == -9999, NA, te)})), 
                           nrow = n.perf - 3, byrow = F))

  # add performance measure name as first column, strip from superfluous space characters along the way
  res <- cbind(NAME = gsub(pattern = "^ *|(?<= ) | *$", replacement = "", te[, 1][3:(n.perf - 1)], perl = TRUE), res)

  # move data numbers for regional and mean/median criteria to attribute vectors, if they exist
  if (res[nrow(res) - 1, 1] == "Number of data for regional criterion") {
    attr(res, "n.data.regional") <- as.numeric(res[nrow(res) - 1, -1])
    res <- res[-(nrow(res) - 1), ]
  }

  if (res[nrow(res), 1] == "Number of areas in mean/median criterion") {
    attr(res, "n.area.mean") <- as.numeric(res[nrow(res), -1])
    res <- res[-(nrow(res)), ]
  }

  # add HYPE variables as attributes and column names
  hvar <- strsplit(gsub(pattern = "^ *", replacement = "", te[, 2][seq(from = 1, length.out = n.var, by = n.perf)]), ", ")
  hvar <- matrix(toupper(unlist(hvar)), ncol = 2, byrow = T)
  attr(res, "variable.obs") <- hvar[, 1]
  attr(res, "variable.sim") <- hvar[, 2]
  names(res)[-1] <- apply(hvar, 1, paste, collapse = ".")

  ## add aggregation periods as attributes, with conversion of letter code to standard HYPE numeric codes (as used in info.txt)

  # extract information
  agg.period <- te[, 2][seq(from = 2, length.out = n.var, by = n.perf)]

  # conditional: check if periods are given as numeric, and if not, try to convert from predefined character codes, 
  #              as last option return whatever was found in file and return a warning

  if (any(is.na(suppressWarnings(as.numeric(agg.period))))) {

    # periods are coded as characters, try to convert to numeric code
    te.agg <- gsub(pattern = "^ *", replacement = "", agg.period)
    if (all(te.agg %in% c("DD", "WW", "MM", "YY"))) {
      te.agg <- factor(te.agg, levels = c("DD", "WW", "MM", "YY"))
      levels(te.agg) <- 1:4
      te.agg <- as.numeric(te.agg)
      attr(res, "agg.period") <- te.agg
    } else {
      attr(res, "agg.period") <- te.agg
      warning("Aggregation period code unknown. Returned as imported in attribute 'agg.period'.")
    }
  } else {
    # periods are coded numerically, just return the numbers
    attr(res, "agg.period") <- as.numeric(agg.period)
  }

  ## add general simulation attributes from simass.txt file header

  # simulation number (in case of monte carlo results)
  attr(res, "n.simulation") <- as.numeric(substr(fhead[2], 20, 100))

  # conditional on number of header rows
  if (n.head == 5) {
    te.head <- as.numeric(na.omit(suppressWarnings(as.numeric(strsplit(paste(fhead[3:4], collapse = " "), split = " +|,")[[1]]))))
  } else {
    te.head <- as.numeric(na.omit(suppressWarnings(as.numeric(strsplit(fhead[3], split = " +|,")[[1]]))))
  }

  # total criteria value
  attr(res, "crit.total") <- te.head[1]
  # conditional criteria value
  attr(res, "crit.conditional") <- te.head[2]
  # threshold
  attr(res, "threshold") <- te.head[3]

  return(res)

}

rcapell commented 3 years ago

@dgustafsson Closing for now, please re-open if it does not work for you.

dgustafsson commented 3 years ago

hi Rene, I was busy using my function on some work so couldnt follow here. Good that you made your own version. Lets continue with this one if it manage to read from different versions of simass.

Attached is an update on my function (the initial one had some bug in it) and a simass that it can read.

I tried to used the HYPE criteria abbreviations (the ones you normally find in allsim.txt) as column names in the dataframe, and provided a columns-key with the abbreviation and the full write out definition as given in Simass.txt as one of the attributes.

Otherwise I think its pretty much the same. /David readsimass.txt simass.txt

dgustafsson commented 3 years ago

actually, I think our data frames are transposed. Which one is more useful can be discussed, maybe we could have both methods as option? In your case you have the variable-pair as columns names, and provide the criteria definition in the first NAME column. So it might be some work to findout in which row-column a specific criteria is located for a specific evaluation. In my case, I use the HYPE criteria abbreviations as columns names, and put one line for each criteria. So I can directly read out a specific criteria from a given column. But I still need to check on which row I have a specific evaluation variable pair.

Moreover, my function is not general as it is now - I just made it work on the files I work on for the moment.

dgustafsson commented 3 years ago

Why do you "move data numbers for regional and mean/median criteria to attribute vectors, if they exist" - they could as well be in the data frame for easier access.

dgustafsson commented 3 years ago

'data.frame':   10 obs. of  27 variables:
 $ CRIT   : num  0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
 $ rvar   : chr  "rgqrin" "xom0" "xom1" "xom2" ...
 $ cvar   : chr  "rgqrin" "xom0" "xom1" "xom2" ...
 $ period : chr  "DD" "DD" "DD" "DD" ...
 $ RR2    : num  0.7139 0.2618 0.0743 -0.2736 0.1986 ...
 $ RRA    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ RRE    : num  -0.0734 -0.1276 -0.2119 -0.0176 -0.1266 ...
 $ RMAE   : num  16 142 241 198 210 ...
 $ MR2    : num  7.14e-01 -1.59e+06 -1.36e+06 -1.01e+08 -6.19e+06 ...
 $ MRA    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ MRE    : num  -0.0734 -0.09 -0.0974 0.106 -0.0317 ...
 $ MRS    : num  -0.231 1.409 1.614 1.616 1.748 ...
 $ MCC    : num  0.85 0.494 0.649 0.324 0.784 ...
 $ MAR    : num  0.0734 0.1853 0.2394 0.3238 0.2316 ...
 $ AKG    : num  0.715 -1.01 -1.066 -1.289 -1.042 ...
 $ ASCKG  : num  0.5563 -0.0642 0.0309 -0.1283 0.0656 ...
 $ SR2    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ SRA    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ SRE    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ TAU    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ MD2    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ MDA    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ MKG    : num  0.714 -15.961 -29.733 -26.346 -4.828 ...
 $ MNR    : num  -9999 -9999 -9999 -9999 -9999 ...
 $ MNW    : num  0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
 $ NDATREG: num  701 53 58 36 55 ...
 $ NSUBME : num  1 27 32 18 28 20 19 21 5 496
 - attr(*, "simulation.number")= num 1
 - attr(*, "total.criteria")= num 0.532
 - attr(*, "number.of.criterions")= num 10
 - attr(*, "individual.criterions")= num  0.71488 -0.6547 -0.00248 -0.58081 0.32692 ...
 - attr(*, "column.key")='data.frame':  29 obs. of  2 variables:
  ..$ short: chr  "CRIT" "rvar" "cvar" "period" ...
  ..$ long : chr  "Criteria value" "observed variable" "simulated variable" "aggregation period" ...

rcapell commented 3 years ago

Why do you "move data numbers for regional and mean/median criteria to attribute vectors, if they exist" - they could as well be in the data frame for easier access.

Formal reasons only: they aren't criteria. But I have no strong opinion on this, and can leave them in the dataframe for faster access. Comes with next update...

rcapell commented 3 years ago

actually, I think our data frames are transposed. Which one is more useful can be discussed, maybe we could have both methods as option? In your case you have the variable-pair as columns names, and provide the criteria definition in the first NAME column. So it might be some work to findout in which row-column a specific criteria is located for a specific evaluation. In my case, I use the HYPE criteria abbreviations as columns names, and put one line for each criteria. So I can directly read out a specific criteria from a given column. But I still need to check on which row I have a specific evaluation variable pair.

Moreover, my function is not general as it is now - I just made it work on the files I work on for the moment.

One advantage with your sorting is that it is comparable to subass dataframes, just with row-wise variable pairs instead of subbasins. Since you are the only user so far, I will adapt the function to your structure.

Btw, do you pull from master, or do you use a release version of HYPEtools? Just asking to see if you can test the packaged function immediately.

rcapell commented 3 years ago

Your simass has a header part "Individual criterions" which ReadSimass does not handle because I didn't know it existed. The HYPE wiki pages should be a bit more detailed for this one.

dgustafsson commented 3 years ago

Yes, Lotta makes her best to keep the wiki up-to-date with the HYPE code - maybe this new header is just in the trunk now and not yet in a release version.

I usually install release version of the HYPEtools - but I could try to pull from the master.

dgustafsson commented 3 years ago

also, please note that I had to make up some of the criteria abbreviations myself, since they are currently only in the simass and not in the allsim. But most of them are there already.

rcapell commented 3 years ago

also, please note that I had to make up some of the criteria abbreviations myself, since they are currently only in the simass and not in the allsim. But most of them are there already.

I'll just copy your list... :) :+1:

rcapell commented 3 years ago

I usually install release version of the HYPEtools - but I could try to pull from the master.

Yes, please!

rcapell commented 3 years ago

Function updated. @dgustafsson please check and close this issue if it works for you (take also a look at ?ReadSimass). Note that some names are slightly different, and I chose to only add the long names as attribute, since the short names are already the in there as column names.

> library(HYPEtools)
> te <- ReadSimass(filename = "simass_david.txt")
> te; str(te)
   VAR.OBS VAR.SIM PERIOD      CRIT        RR2 RRA           RRE        RMAE           MR2 MRA           MRE         MRS       MCC          MAR
1   RGQRIN  RGQCIN      1  0.864076  0.7912192  NA -0.0003572253  12.4628900  7.912192e-01  NA -0.0003572253 -0.07986564 0.8900149 0.0003572253
2     XOM0    SNOW      1  0.042483 -0.4324844  NA  0.0894796400 197.2580000 -8.702523e+05  NA  0.1416273000  0.94831880 0.6801058 0.2692583000
3     XOM1    SNOW      1 -0.706594 -0.1238468  NA  0.0300026440 251.2287000 -3.174128e+06  NA  0.1765938000  2.11882000 0.5199575 0.3620609000
4     XOM2    SNOW      1  0.028393 -2.5348170  NA  0.3168145000 307.7719000 -7.523222e+06  NA  0.4561169000  0.45876510 0.7805144 0.5986000000
5     XOM3    SNOW      1 -0.793188 -1.1334970  NA  0.2943464000 303.8858000 -1.060366e+07  NA  0.4418611000  2.13051200 0.1121179 0.5308743000
6     XOM4    SNOW      1 -0.758312 -1.0691030  NA  0.7006303000 496.1250000 -1.279592e+06  NA  1.7223780000  2.39803700 0.5797362 1.7978190000
7     XOM5    SNOW      1  0.352031  0.3068787  NA  0.0302929580 152.0165000 -2.176184e+06  NA  0.1311511000  1.32106300 1.0000010 0.2775741000
8     XOM6    SNOW      1 -0.107132 -0.6741910  NA -0.2224605000 207.6275000 -1.079606e+07  NA -0.0314483610  1.94134700 0.2888376 0.3788203000
9     RSNW    SDEP      1 -0.315323 -0.5683091  NA  0.9155864000  27.1072400 -1.298777e+00  NA  1.8147230000  0.88713530 0.8348566 1.8147230000
10    RFSC    CFSC      1  0.877641  0.7816473  NA -0.0905570240   0.1031572  7.012007e-01  NA -0.0804604070 -0.01211407 0.9053226 0.1005598000
          AKG       ASCKG SR2 SRA SRE SBIAS SRMSE TAU         MD2 MDA         MKG        MNR         MNW NDATREG NSUBME
1   0.8640758  0.76068100  NA  NA  NA    NA    NA  NA   0.7912192  NA  0.86407580 0.06209185   0.7912193     730      1
2  -0.5090607 -0.00758864  NA  NA  NA    NA    NA  NA  -2.8428130  NA  0.04248300 0.28092030  -3.8319060      61     31
3  -1.6216440 -0.10323430  NA  NA  NA    NA    NA  NA  -6.5355150  NA -0.70659380 0.34778230 -19.5315200      63     31
4  -0.3509475  0.06487172  NA  NA  NA    NA    NA  NA  -3.7858330  NA  0.02839289 0.33119350  -0.1582873      38     20
5  -1.7820260 -0.26951890  NA  NA  NA    NA    NA  NA -25.3475600  NA -0.79318770 0.47767260 -20.7095400      58     30
6  -2.2894860 -0.22568740  NA  NA  NA    NA    NA  NA  -9.3490020  NA -0.75831190 0.50588980 -19.3249400      49     25
7  -0.5387785  0.15672500  NA  NA  NA    NA    NA  NA  -0.4864217  NA  0.35203140 0.23205310  -7.6148040      56     29
8  -1.8229060 -0.01845478  NA  NA  NA    NA    NA  NA  -4.6886790  NA -0.10713190 0.25110890 -31.8774000      53     25
9  -1.0622980 -0.21341400  NA  NA  NA    NA    NA  NA  -0.9852419  NA -0.31532300 0.36677150  -0.6400305    2753      5
10  0.8475954  0.75212010  NA  NA  NA    NA    NA  NA   0.8211232  NA  0.87764060 0.18951570   0.8110875  143582    496
'data.frame':   10 obs. of  29 variables:
 $ VAR.OBS: chr  "RGQRIN" "XOM0" "XOM1" "XOM2" ...
 $ VAR.SIM: chr  "RGQCIN" "SNOW" "SNOW" "SNOW" ...
 $ PERIOD : num  1 1 1 1 1 1 1 1 1 1
 $ CRIT   : num  0.8641 0.0425 -0.7066 0.0284 -0.7932 ...
 $ RR2    : num  0.791 -0.432 -0.124 -2.535 -1.133 ...
 $ RRA    : num  NA NA NA NA NA NA NA NA NA NA
 $ RRE    : num  -0.000357 0.08948 0.030003 0.316814 0.294346 ...
 $ RMAE   : num  12.5 197.3 251.2 307.8 303.9 ...
 $ MR2    : num  7.91e-01 -8.70e+05 -3.17e+06 -7.52e+06 -1.06e+07 ...
 $ MRA    : num  NA NA NA NA NA NA NA NA NA NA
 $ MRE    : num  -0.000357 0.141627 0.176594 0.456117 0.441861 ...
 $ MRS    : num  -0.0799 0.9483 2.1188 0.4588 2.1305 ...
 $ MCC    : num  0.89 0.68 0.52 0.781 0.112 ...
 $ MAR    : num  0.000357 0.269258 0.362061 0.5986 0.530874 ...
 $ AKG    : num  0.864 -0.509 -1.622 -0.351 -1.782 ...
 $ ASCKG  : num  0.76068 -0.00759 -0.10323 0.06487 -0.26952 ...
 $ SR2    : num  NA NA NA NA NA NA NA NA NA NA
 $ SRA    : num  NA NA NA NA NA NA NA NA NA NA
 $ SRE    : num  NA NA NA NA NA NA NA NA NA NA
 $ SBIAS  : num  NA NA NA NA NA NA NA NA NA NA
 $ SRMSE  : num  NA NA NA NA NA NA NA NA NA NA
 $ TAU    : num  NA NA NA NA NA NA NA NA NA NA
 $ MD2    : num  0.791 -2.843 -6.536 -3.786 -25.348 ...
 $ MDA    : num  NA NA NA NA NA NA NA NA NA NA
 $ MKG    : num  0.8641 0.0425 -0.7066 0.0284 -0.7932 ...
 $ MNR    : num  0.0621 0.2809 0.3478 0.3312 0.4777 ...
 $ MNW    : num  0.791 -3.832 -19.532 -0.158 -20.71 ...
 $ NDATREG: num  730 61 63 38 58 ...
 $ NSUBME : num  1 31 31 20 30 25 29 25 5 496
 - attr(*, "names.long")= chr [1:29] "Observed variable" "Simulated variable" "Aggregation period" "Individual criterion" ...
 - attr(*, "n.simulation")= num 1
 - attr(*, "crit.total")= num 0.516
 - attr(*, "crit.conditional")= num 0
 - attr(*, "threshold")= num 0

rcapell / HYPEtools

ReadSimass #106