vz-risk / verisr

R package for working with VERIS data
21 stars 11 forks source link

Querying Multiple Enumerations does not work #18

Closed naif-alsaleh closed 3 years ago

naif-alsaleh commented 5 years ago

This feature was working just fine in the previous project jayjacobs/verisr

naif-alsaleh commented 5 years ago

This is the error message Warning message: In grep(paste0("^", enum, "[.][A-Z0-9][^.]*$"), names(subdf), value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used

onlyphantom commented 5 years ago

I don't believe it's maintained anymore - but I rewrote a lot of the functions in pure base R (the underlying code for this package is data.table) and only where necessary, dplyr. All the core functionalities plus some are supported and compatible with later versions of R, tested on R 3.5+.

You can install the package from here: https://github.com/onlyphantom/verisr2 and feel free to open any issues!

gdbassett commented 4 years ago

Odd, I didn't realize I don't get alerts from issues to verisr. I do maintain it. I had already removed most of the data.table though had left it in the import script json2veris() as the data.table is easier to populate. (We have to populate several-hundred K versions.)

The issue with enumerating 'pattern' can come as jay originally had both a 'pattern' column as well as 'pattern.X' columns. The problem being that 'pattern' is not exclusive while the 'pattern.X' are not (and there are pattern overlaps). I believe I removed the creation of the 'pattern' column from json2veris() to help prevent that.

There is another issue that Jay's getenum() function could take multiple enumerations, but it was really just treating the data as sets (all combinations of enum1 and enum2). The problem is 2-fold, in that this can continue to scale (e.g. all combinations of enums 1, 2, and 3...) as well as it prevents intelligent calculation of the sample size. (For example, the getenumCI() function counts incidents with only 'Unknown' in the enumeration as 'unmeasured' and doesn't include them in the sample size). As such, when we need to calculate across sets, we generally do it manually (it's rare, though it does happen). The rest of the time we use getenumCI().

If you want to still use jay's 2-enum function, you should be able to use getenum2(). It is still there and takes enum, primary, and secondary. If you have any questions, hopefully responding to this will ensure I actually see the alerts