r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.32k stars 293 forks source link

Garbage collector not working over SF objects #1972

Closed latot closed 1 year ago

latot commented 2 years ago

Hi, I found this while playing with objects, I notice the next thing:

#Eats 8GB of ram
a <- st_read("tmp.gpkg")
#Erase the data
a <- NULL
# D: checking the ram monitor, nothing changes, the 8GB are still used
# Lets run garbage collector()
gc()
# Only reduced by 800mb, still a lot of GB used

The tmp.gpkg is a file I prepared just to do this easy to notice, a small file with a lot of bind rows of self, 1.6GB as a file, 8GB in Ram.

I work with a lot of data, is really a problem the amount of Ram sf can use, specially because R is not able to clean the objects, I did other tests with the same result, all running gc too.

The only way to run the scripts is split the code in several R files, and call one by one with Rscript saving the results, when R is closed is the only moment and way I found to can clean the objects that are not needed.

Linux, Gentoo, 64 bits, Systemd R 4.2 sf 1.0-7

Thx!

rsbivand commented 2 years ago

Which OS platform and version? Is this https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-is-R-apparently-not-releasing-memory_003f

latot commented 2 years ago

Sorry! I forgot that, I updated the main comment above with the versions.

rsbivand commented 2 years ago

So the answer is the FAQ, not anything sf or R can do about it, if I read the FAQ correctly.

latot commented 2 years ago

mm, very intersting info, a hard problem:

> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  698447 37.4    1347204   72   966237 51.7
Vcells 1253831  9.6    8388608   64  1842100 14.1
> b <- sf::st_read("tmp.gpkg", quiet=TRUE)
> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  26134448 1395.8   63700347 3402.0  63700347 3402.0
Vcells 159439181 1216.5  329267996 2512.2 315275414 2405.4
> b<-NULL
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  707412 37.8   50960278 2721.6  63700347 3402.0
Vcells 3303099 25.3  263414397 2009.7 315275414 2405.4

Like this, the memory still used, the info of the last gc is almost the same as the first one, if we follow the but this whole area is always contiguous talking about the allocated ram to use, would be like, after loading the data, then other R object is allocated after it, so even after clear the big data, it can't un unloaded.....

latot commented 2 years ago

But still is weird, I try the next thing, after the code above, the file was free, si internally R have Ram to use, but the SO can't release it, so, I try reading again the file, the idea is, in case that memory is free for R but not for the system, I should be able to read it again and use the same amount of Ram, the result... R start requesting even more Ram...., now I'm not sure if this is only the SO.

latot commented 2 years ago

Hi! @rsbivand

Welllllllll!!! I found a solution.

Symptom: The only ram that can be free by the system, is from, the high address of ram, to the lower one while in the middle there is no used addresses. Seems R don't organize its own memory, sooo..., the ram can't be cleared by the system.

Solution: Run the script as multi-process with no recicle child (use an used child to avoid use time creating a new one) with one core.

future::plan(future.callr::callr, workers = 1)
options(future.globals.maxSize = 10000 * 1024^2)

i_m_a_heavy_sf <- future::value(future::future({
   data <- sf::st_read("big_data.gpkg", quiet=TRUE)
   data$centroid <- sf::st_centroid(data)
   #... more heavy stuff
   return(data)
}, gc=TRUE, seed=TRUE, packages="sf"))

This very trick option, uses futureto start childs, with the plan future.callr::callr, this plan open new childs and when finish close it, so all the used ram, and the one that can't be cleaned will be go released, and only will return the processed data.

The other plans of future don't always clean the ram, and some time have some behaviors that changes with the context.

There is a few things that affect this.

There is still things that we can do this method better, but as a start point to keep the ram clean and low I think is great, we can even, split the process above in several ones, and run one by one with future.

I don't think this is only a sf thing, but how sf can works with a lot of data, maybe its worth write about this in the docs and some measures to avoid eat the ram.

Thx!

rsbivand commented 2 years ago

Good. I don't see anything obvious in the GPKG driver https://gdal.org/drivers/vector/gpkg.html#vector-gpkg that might be used to reduce memory use, but maybe some of the streaming innovations in the development version of GDAL, associated with work on the Arrow driver, might offer assistance in the future? @paleolimbot do you perhaps have any views on this problem? Could you also try in geopandas?

paleolimbot commented 2 years ago

This is an interesting observation (that R-allocated memory is fragmented and unable to be returned to the system), although are you sure that you will actually run out of memory? If R can't give the memory back to the system it should still be free to use by R, so it may not be a problem.

I haven't investigated it properly so take this with a grain of salt, but I believe that sf's strategy for executing any kind of geospatial operation (e.g., intersection) is to copy the entire vector of geometries into s2/GEOS pointers, do the operation and store the result as an array of s2/GEOS pointers, then convert that result into sf objects. That leaves a plausible memory usage at 4 times the memory required for the input (although only half of that is R-allocated memory).

A "right now" solution might be to use the s2 or geos packages directly: convert once, do some series of operations, then convert back at the end (this is more or less what geopandas in Python does, storing geometries as pointers to GEOS geometries). In the distant future we might be able to use a geometry encoding + data structure ('geoarrow') that we can get directly to/from GDAL (and maybe GEOS someday) without any copying at all.

rsbivand commented 2 years ago

@paleolimbot yes, multiple copies could be monitored more in topological operations. Here the case is reading a large file, and I was wondering whether the streaming read updates to OGR in GDAL based on understanding derived from geoarrow or similar might help. Also whether external conversion to arrow then using that column-wise driver rather than the GPKG driver might reduce memory footprint. Generally as I read the FAQ, the memory is sort-of reserved for R until it exits.

paleolimbot commented 2 years ago

I'm less familiar with how sf reads from OGR, although I think that it's feature-wise (like, the geometry for each feature is created and the OGRGeometry discarded, which is the more memory-efficient way to do it). I don't think that the Arrow driver - column-wise or otherwise - would reduce the amount of memory used although it might reduce the amount of R memory used (if the OGR implementation for R effectively uses ALTREP to avoid materializing all the string attribute columns at once and maybe avoid materializing the geometry all at once).

Arrow or otherwise, the approach for working with bigger data is still to read the file in chunks (using the sql argument in read_sf() to subset rows or columns before they end up in R). For now that has to happen manually...the Arrow driver for GDAL might make it easier to use the 'arrow' R package's facilities for this chunking without explicit user intervention.

rsbivand commented 2 years ago

@latot could you please provide code showing how you generated the 1.6GB GPKG file? Were the geometries points, lines or polygons?

latot commented 2 years ago

Hi, get a reprex was really slow and hard, even 25MB of data, really eats my ram to save and load the data, so, lets go step by step.

To get the size of objects, I'll use pryr::object_size, to know how much mem the process is using, I'll check the system monitor, where I can see from the system perspective how much ram R is using.

First construct the file to play

library(magrittr)

data <- c(sf::st_point(c(-72, -37)), sf::st_point(c(-72, -37.1))) %>%
    sf::st_linestring() %>%
    sf::st_sfc() %>%
    sf::st_set_crs(4326) %>%
    sf::st_as_sf()

limit <- 0.025 #in GB

pryr2num <- function(x){
    m <- x %>%
        format() %>%
        strsplit(" ")
    m <- m[[1]]
    size <- as.numeric(m[[1]])
    if (length(m) == 2){
      uni <- m[[2]]
    } else {
      uni <- "B"
    }
    if (uni == "B") size <- size/10^9
    if (uni == "kB") size <- size/10^6
    if (uni == "MB") size <- size/10^3
    if (uni == "TB") size <- size/10^-3
    size
}

while (TRUE){

    data <- dplyr::bind_rows(data, data)
    s <- pryr::object_size(data) %>% pryr2num()
    if (s >= limit){
        break
    }

}

sf::st_write(data, "temp2.gpkg", append=FALSE)

Here, we contruct the test file, it has the following props:

Tests

Load the file normally with sf

Load the file with future, code in https://github.com/r-spatial/sf/issues/1972#issuecomment-1199968742:

Oks, now, following the other comments, usually we expect, if we remove a file, or do it NULL, even if Ram in the system is used, R should be able to use it internally, like "free memory for other objects".

I run tests where I load a file, unload the data, load again the data, and unload it, checking the Ram used by R, and other similar test, like instead with spatial files with big matrix, the results shows curves like this:

Ram Used by R = A-B/(C + 1)

Where:

Here is a graph: https://www.wolframalpha.com/input?i=1-1%2F%28x%2B1%29

A loop will be, load an object, and then release it.

In the first loops, we can notice the Ram will increase, even if we unload the data, but there is a point, where the Ram does not increase any more even if you run it lot of times.

A and B seems to be... random, I can't tell when the Ram will be stabilized and in how much, but tends to happens in some time.

Most of times this can no have a big effect, but, if you run several functions or long/heavy scripts, this can be a problem, seems every function have a limit of how much ram will plus to the total.

In my experience, parallelize code with R is hard, not because there is no tools for it, is because seems, every child will increase their Ram until the point there is no more, this can happens when you use a lot of data, this particular point has not been tested deeply, but run furrr with a plan that not keeps the ram seems to help a lot.

There is two main reasons why I use parallel code now.

The trick is, when you have a lot of data, split the work in small chucks and use the most threads we can, is just hard to do this when the Ram usege of R is so hard to admin, but while you are able to do it, have a balance between chunks and cores helps a lot.

That is what I have tested, hope this helps :)

rsbivand commented 2 years ago

The repetition of the same point is not very informative. I ran:

library(sf)
nc <- st_read(system.file("gpkg/nc.gpkg", package="sf"))
object.size(nc) # roughly 140Kb
st_write(nc, "big_nc.gpkg", append=FALSE)
for (i in 1:20000) st_write(nc, "big_nc.gpkg", append=TRUE) # about 1.3 Gb

In a new session:

> big_nc <- sf::st_read("big_nc.gpkg")
Reading layer `big_nc' from data source `/home/rsb/tmp/big_nc.gpkg' using driver `GPKG'
Simple feature collection with 2000100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
> object.size(big_nc)
2483344520 bytes
> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  15349387  819.8   29544647 1577.9  29544647 1577.9
Vcells 147546795 1125.7  282584697 2156.0 282584697 2156.0
> rm(big_nc)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  708350 37.9   23635718 1262.3  29544647 1577.9
Vcells 2258900 17.3  226067758 1724.8 282584697 2156.0

where gc() shows the fall in used memory.

For the sake of interest, I converted to Arrow and Parquet, with ogr2ogr, files a good deal smaller, but sf object the same size. I may try to see with the development version of GDAL whether streaming reading is more efficient, R seemed to use about 7Gb, maybe 3 times object size, suggesting that perhaps a copy remains somewhere.

Curiously:

> big_nc <- terra::vect("big_nc.gpkg")
> object.size(big_nc)
1304 bytes
> big_nc
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 2000100, 14  (geometries, attributes)
 extent      : -84.32385, -75.45698, 33.88199, 36.58965  (xmin, xmax, ymin, ymax)
 source      : big_nc.gpkg
Error in (function (x)  : attempt to apply non-function
Error in x$.self$finalize() : attempt to apply non-function
 coord. ref. : lon/lat NAD27 (EPSG:4267) 
 names       :  AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS    FIPSNO CRESS_ID
 type        : <num>     <num> <num>   <num>     <chr> <chr>     <num>    <int>
 values      : 0.114     1.442  1825    1825      Ashe 37009 3.701e+04        5
               0.061     1.231  1827    1827 Alleghany 37005   3.7e+04        3
               0.143      1.63  1828    1828     Surry 37171 3.717e+04       86
 BIR74 SID74 (and 4 more)
 <num> <num>             
  1091     1             
   487     0             
  3188     5             

which are clearly pointers to objects in C++, not in the R workspace. Running summary(big_nc$BIR74) brought a good deal into memory, though, so doing anything with the SpatVector object does read it into the R workspace. So FAQ 7.42 is key to seeing what is going on - creating a new R session to do the reading then dropping that process frees the memory, but the object is gone too.