r-lib / fastmap

Fast map implementation for R
https://r-lib.github.io/fastmap/
Other
133 stars 7 forks source link

Saving a fastmap and constant fastmap #33

Closed stla closed 1 year ago

stla commented 1 year ago

Hello,

Am I right or wrong to believe that a fastmap works with external pointers, and therefore cannot be saved as a rds file?

I want to save a fastmap for a package. My fastmap represents the number sqrt(2) for example. Moreover it would be better if it were constant, that is to say if it had only the get and keys methods. I'm wondering whether it would be better to create a simple list with two fields get and keys instead of using a fastmap.


Edit

But this fastmap must be cloneable too.

wch commented 1 year ago

The fastmap does use an external pointer to a map data structure in C++. However, it also can be saved as an rds file. When a fastmap object is loaded from an rds, the first time you call a method on it, it will rebuild the C++ data structure from the R data structures which can be restored from the rds file. See https://github.com/r-lib/fastmap#serialization for more info.

I don't understand how your fastmap represents a number like sqrt(2). But generally speaking, if you have a small number of keys, the method you describe should be fine. Just keep in mind that if you are using a named list as the backing store, R will do a linear search for keys. This is fine for relatively small number of keys, but will slow down as the list gets larger. You could also use an environment, which like fastmap has a constant lookup time.

stla commented 1 year ago

However, it also can be saved as an rds file.

Awesome. I love R!

I don't understand how your fastmap represents a number like sqrt(2).

My fastmap represents a cyclotomic number. The cyclotomic numbers are complex numbers. They form a field and they contain all square roots of integers (and more). Mathematically this is a polynomial, encoded as a fastmap. The interest is that you can do exact calculations, e.g. sqrt(2) * sqrt(3) == sqrt(6) is TRUE, there's no conversion to float.

Thanks!

stla commented 1 year ago

Hi @wch

I finished my R package (cyclotomic, soon on CRAN). Finally I don't use fastmap. I needed an ordered container with integer keys for cyclotomic, then I did another package, intmap (on CRAN now), which wraps the Boost flat_map. The intmap is slower than a list in general, but it is implemented as a R6 class and it can be more convenient than a list in some situations. On the Rcpp side, the intmap is implemented as a Rcpp module, I'm wondering whether this is why it is slower than a list.

In the description of fastmap, you say that a flatmap is better than a list from a memory perspective. Could you please tell me how do you assess that? Is there a R package to measure the "memory performance"? (I'm totally ignorant in this domain, I don't know what is a "memory leak").

wch commented 1 year ago

In the description of fastmap, you say that a flatmap is better than a list from a memory perspective.

Do mean fastmap instead of flatmap?

In the README, I say that fastmap does not leak memory, as opposed to an environment, not a list.

Environments are the R data structure that is best suited for use as a key-value store. Named lists are generally not as good as environments, because with lists, (A) it does a linear search to find the key, and (B) adding or removing elements can result R making a copy of the list, which can be slow.

However, for small sets of items, named lists might be as fast or faster than environments because there is less overhead in these operations. If you want to understand this stuff, you should learn about big O notation and linear search vs. hash tables.

This example demonstrates the memory leak: https://github.com/r-lib/fastmap#memory-leak-examples

If you need to learn what a memory leak is, there are resources on the internet that do a better job explaining it than I can.

stla commented 1 year ago

Ok, thanks. Yes I meant fastmap. The flat_map is the map of my package, from Boost. I will take a look to hash tables and to your link.

stla commented 1 year ago

On my Linux, with your first code (the one with a regular environment), I get something different than the results you show: the mem_used is constant, and the memory leaked is low (112 bytes). With my intmap, the elapsed time is indeed higher, and the memory leaked is 0 byte.

wch commented 1 year ago

On my Mac and on Linux (with the latest version of R-devel) I still get the memory leak.

On Linux:

> sessionInfo()
R Under development (unstable) (2023-04-16 r84269)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/local/RD/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fastmap_1.1.1 pryr_0.1.6   

loaded via a namespace (and not attached):
 [1] compiler_4.4.0   magrittr_2.0.3   cli_3.6.1        tools_4.4.0     
 [5] parallel_4.4.0   glue_1.6.2       Rcpp_1.0.10      stringi_1.7.12  
 [9] codetools_0.2-19 stringr_1.5.0    lifecycle_1.0.3  rlang_1.1.0     

> library(pryr)
gc()
start_mem <- mem_used()
start_time <- as.numeric(Sys.time())
for (i in 1:8) {
  cat(i, ": ", sep = "")
  print(mem_used())
  e <- new.env(parent = emptyenv())
  for (j in 1:10000) {
    # Generate random key
    x <- as.character(runif(1))
    exists(x, envir = e, inherits = FALSE)
  }
  rm(e, x)
}
end_time <- as.numeric(Sys.time())
gc()
end_mem <- mem_used()
cat("Elapsed time:", round(end_time - start_time, 1), "seconds\n")
cat("Memory leaked:", end_mem - start_mem, "bytes\n")
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 459604 24.6     977744 52.3   670820 35.9
Vcells 829506  6.4    8388608 64.0  1869433 14.3
1: 36.8 MB
2: 39 MB
3: 41 MB
4: 43 MB
5: 45 MB
6: 47 MB
7: 49 MB
8: 51 MB
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  761752 40.7    1213292 64.8  1084541   58
Vcells 1288630  9.9    8388608 64.0  4449608   34
Elapsed time: 2 seconds
Memory leaked: 20589896 bytes

I would be surprised if you are not seeing a memory leak. What is your sessionInfo, and are you sure you're running that exact same code?

stla commented 1 year ago

Indeed.

> library(pryr)
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  808656 43.2    1548232 82.7  1193947 63.8
Vcells 1927909 14.8    8388608 64.0  4172681 31.9
> start_mem <- mem_used()
> start_time <- as.numeric(Sys.time())
> for (i in 1:8) {
+     cat(i, ": ", sep = "")
+     print(mem_used())
+     e <- new.env(parent = emptyenv())
+     for (j in 1:10000) {
+         # Generate random key
+         x <- as.character(runif(1))
+         exists(x, envir = e, inherits = FALSE)
+     }
+     rm(e, x)
+ }
1: 60.8 MB
2: 63 MB
3: 65 MB
4: 67 MB
5: 69.5 MB
6: 71.5 MB
7: 73.5 MB
8: 75.5 MB
> end_time <- as.numeric(Sys.time())
> gc()
          used (Mb) gc trigger  (Mb) max used (Mb)
Ncells 1052179 56.2    1897878 101.4  1373544 73.4
Vcells 2319993 17.8    8388608  64.0  5480731 41.9
> end_mem <- mem_used()
> cat("Elapsed time:", round(end_time - start_time, 1), "seconds\n")
Elapsed time: 1.1 seconds
> cat("Memory leaked:", end_mem - start_mem, "bytes\n")
Memory leaked: 16774168 bytes

Yesterday I changed runif(1) to rpois(1, 100), that's the only difference.

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.3

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=fr_BE.UTF-8       LC_NUMERIC=C               LC_TIME=fr_BE.UTF-8       
 [4] LC_COLLATE=fr_BE.UTF-8     LC_MONETARY=fr_BE.UTF-8    LC_MESSAGES=fr_BE.UTF-8   
 [7] LC_PAPER=fr_BE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_BE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pryr_0.1.6

loaded via a namespace (and not attached):
 [1] compiler_4.2.3   magrittr_2.0.3   cli_3.6.1        tools_4.2.3      glue_1.6.2      
 [6] rstudioapi_0.14  Rcpp_1.0.10      stringi_1.7.12   codetools_0.2-19 stringr_1.5.0   
[11] lifecycle_1.0.3  rlang_1.1.0