pik-piam / magclass

R package | Data Class and Tools for Handling Spatial-Temporal Data
GNU Lesser General Public License v3.0
4 stars 24 forks source link

mselect fails with long list of characters #121

Closed gabriel-abrahao closed 1 year ago

gabriel-abrahao commented 2 years ago

Apparently mselect can't subset from a list of elements that is too long. Works up to ~2000 entries, but breaks down with a regex error for lists larger than that, and coupled REMIND-MAgPIE results have 5272 at the time of writing. Below is a reproducible example in the cluster, replace mat with any 3-dimensional magpie object with a large number of entries in the third dimension.

The error happens in .mselectSupport. It creates a regex that matches all the names in the list, which is then used by grep find what elements match it.

Likely, the problem with it is that this regex can become huge if many different names are passed (or if names are too long for that matter). The default grep uses the default POSIX regex implementation. As per the documentation:

The implementation shall support any regular expression that does not exceed 256 bytes in length.

It apparently works beyond that too, but the string in the example has 337741 bytes, seems to be too much. It first crashes around 2500 variable names in the test, which is a string of 180 KB. Using the PCRE regex standard with perl = TRUE in grep unfortunately yields the same result, and that standard apparently also has a limit of around 64KB in the C library implementation.

mat <- read.magpie("/p/projects/piam/abrahao/scratch/impacts/remind/test_mat.rds")
str(mat)
# ==========================================================
# Formal class 'magpie' [package "magclass"] with 1 slot
#   ..@ .Data: num [1:13, 1:19, 1:5272] 0.842 0.751 0.719 0.709 0.922 ...
#   .. ..- attr(*, "dimnames")=List of 3
#   .. .. ..$ region  : chr [1:13] "CAZ" "CHA" "EUR" "IND" ...
#   .. .. ..$ year    : chr [1:19] "y2005" "y2010" "y2015" "y2020" ...
#   .. .. ..$ variable: chr [1:5272] "Biodiversity|BII (unitless)" "Costs (million US$05/yr)" "Costs Accounting (million US$05/yr)" "Costs Accounting|+|AEI (million US$05/yr)" ...
# ==========================================================

varlist <- getItems(mat, dim = "variable") # List with 5272 variable names
mselect(mat, variable = varlist) %>% str # Fails
# ==========================================================
# Error in grep(search, names) : 
#   invalid regular expression '^(Biodiversity\|BII \(unitless\)|Costs \(million US\$05/yr\)|Costs Accounting \(million US\$05/yr\)|Costs Accounting\|\+\|AEI \(million US\$05/yr\)|Costs Accounting\|\+\|Biodiversity value loss \(million US\$05/yr\)|Costs Accounting\|\+\|Forestry \(million US\$05/yr\)|Costs Accounting\|\+\|GHG Emissions \(million US\$05/yr\)|Costs Accounting\|\+\|Harvesting natural vegetation \(million US\$05/yr\)|Costs Accounting\|\+\|Input Factors \(million US\$05/yr\)|Costs Accounting\|\+\|Land Conversion \(million US\$05/yr\)|Costs Accounting\|\+\|Land transition matrix \(million US\$05/yr\)|Costs Accounting\|\+\|MACCS \(million US\$05/yr\)|Costs Accounting\|\+\|N Fertilizer \(million US\$05/yr\)|Costs Accounting\|\+\|P Fertilizer \(million US\$05/yr\)|Costs Accounting\|\+\|Peatland \(million US\$05/yr\)|Costs Accounting\|\+\|Peatland GHG emisssions \(million US\$05/yr\)|Costs Accounting\|\+\|Processing \(million US\$05/yr\)|Costs Accounting\|\+\|Punishment cos
# ==========================================================

mselect(mat, variable = varlist[1:10]) %>% str # Works
mselect(mat, variable = varlist[1:2000]) %>% str # Works
mselect(mat, variable = varlist[1:2500]) %>% str # Fails again
tscheypidi commented 1 year ago

fixed in db04afb006e633b3782999941aceefd2da128e72