greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
527 stars 63 forks source link

get_unique_name() in node_class.R might not be unique #366

Open njtierney opened 3 years ago

njtierney commented 3 years ago

Perhaps this is not likely to happen, or for this to be an issue, but it seems that the rhex() function as defined isn't gauranteed to create a unique name if there are many many nodes (like 1 million).

This is used in node_class.R.

See example below.

n_rhex <- 1e6

# generate a random 8-digit hexadecimal string
rhex <- function() paste(as.raw(sample.int(256L, 4, TRUE) - 1L), collapse = "")

many_rhex <- replicate(n = n_rhex, expr = rhex(), simplify = "vector")

dplyr::n_distinct(many_rhex)
#> [1] 999874

dplyr::n_distinct(many_rhex) == n_rhex
#> [1] FALSE

Created on 2021-04-08 by the reprex package (v2.0.0)

Perhaps digest or something like https://github.com/coolbutuseless/xxhashlite could be used to give nodes unique IDs

njtierney commented 3 years ago

Or https://github.com/reside-ic/ids or do something like what reprex did (https://github.com/tidyverse/reprex/blob/d2996e01f045b04cd537653a39deece1025dbf35/R/aaa.R), btu this might not be unique enough.

njtierney commented 3 years ago

this is currently being worked on here https://github.com/njtierney/greta/tree/unique-names

njtierney commented 3 years ago
njtierney commented 2 years ago

There is an issue where an error appears:

Error in distrib_constructor(tf_parameter_list, dag = self) : could not find function "distrib_constructor"

Which means it is not finding

https://github.com/greta-dev/greta/blob/112a96804170d7cccdb76b1f413cdbbb23f0738d/R/dag_class.R#L410

It makes me wonder if perhaps this is related to this issue. We have not been able to reliably develop a small reprex for this issue, so it might not be related to this one.

njtierney commented 4 months ago

A note on using hashing like secretbase, which is what targets uses internall. So as long as the nodes aren't identical, this will work, but if two nodes/R6 objects are identical, they will be identical. So I guess the idea is as long as the input isn't identical, it should be OK.

n_rhex <- 1e6

# generate a random 8-digit hexadecimal string
rhex <- function() paste(as.raw(sample.int(256L, 4, TRUE) - 1L), collapse = "")

many_rhex <- function(x) replicate(n = x, expr = rhex(), simplify = "vector")

rhexes <- many_rhex(n_rhex)

dplyr::n_distinct(rhexes)
#> [1] 999883

dplyr::n_distinct(rhexes) == n_rhex
#> [1] FALSE

many_siphash <- function(n) {
  vapply(
  X = seq_len(n), 
  FUN = secretbase::siphash13,
  FUN.VALUE = ""
  )
}

many_siphashes <- many_siphash(n_rhex)

dplyr::n_distinct(many_siphashes)
#> [1] 1000000

dplyr::n_distinct(many_siphashes) == n_rhex
#> [1] TRUE

Created on 2024-05-28 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.0 (2024-04-24) #> os macOS Sonoma 14.5 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2024-05-28 #> pandoc 3.1.13 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.4.0) #> digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0) #> dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.4.0) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) #> knitr 1.46 2024-04-06 [1] CRAN (R 4.4.0) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.4.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.4.0) #> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.4.0) #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.4.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0) #> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.4.0) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.4.0) #> rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.4.0) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0) #> secretbase 0.5.0 2024-04-25 [1] CRAN (R 4.4.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) #> styler 1.10.3 2024-04-07 [1] CRAN (R 4.4.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0) #> xfun 0.44 2024-05-15 [1] CRAN (R 4.4.0) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0) #> #> [1] /Users/nick/Library/R/arm64/4.4/library #> [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
njtierney commented 4 months ago

Other alternatives:

https://github.com/coolbutuseless/cryptorng {digest} ?

njtierney commented 2 months ago

Some ideas on debugging this.

greta_stash$object_counter <- 0L

# generate a unique name for each node.
rhex <- function() {
  count <- greta_stash$object_counter + 1L
  greta_stash$object_counter <- count
  count
  # paste(as.raw(sample.int(256L, 4, TRUE) - 1L), collapse = "")
}

So we get a sense of how many objects are created?