mhahsler / seriation

Infrastructure for Ordering using Seriation - R Package
GNU General Public License v3.0
75 stars 17 forks source link

seriate preserves names/labels of input data and get_order returns named integers #18

Closed david-barnett closed 1 year ago

david-barnett commented 1 year ago

Hi again,

I would like it if the integers returned by get_order() were named, according to the names or dist labels of the original data. I believe this might have already been the intended behaviour, as I noticed there is already some code inside seriate and its subroutines aimed towards preserving the input names, but this wasn't always ultimately seen in the output object.

This pull request makes the preservation of names more consistent, and get_order now always returns named integers. (Either directly using named integers when the ser_permutation_vector has class integer, or by returning reordered hclust labels, when the ser_permutation_vector has class hclust)

I added tests to test-seriate.R to try and ensure I didn't break anything by doing this, and to ensure the returned names are correctly aligned to the integer order. All tests pass on my M1 mac and on Windows (with https://win-builder.r-project.org/) and I can't see any obvious reasons why adding names to the order integers would break any other existing code using seriation.

happy to hear any feedback, and thanks again for your work on this package, David

Demonstration of new behaviour (see below for old behaviour):

# remotes::install_github("david-barnett/seriation@master")
library(seriation)
packageVersion("seriation")
#> [1] '1.4.0.9000'

class(eurodist)
#> [1] "dist"
pimage(eurodist)


# methods like MDS produce integer ser_permutation_vector objects, now with names!
mds <- seriate(eurodist, method = "MDS")
class(mds[[1]])
#> [1] "ser_permutation_vector" "integer"
get_order(mds) # now with names
#>       Gibraltar          Lisbon          Madrid       Barcelona       Cherbourg 
#>               9              12              14               2               5 
#>      Marseilles           Lyons           Paris          Calais          Geneva 
#>              15              13              18               4               8 
#>        Brussels Hook of Holland           Milan         Cologne         Hamburg 
#>               3              11              16               6              10 
#>          Munich      Copenhagen            Rome       Stockholm          Vienna 
#>              17               7              19              20              21 
#>          Athens 
#>               1

# methods like OLO produce hclust ser_permutation_vector objects, get_order now returns names!
olo <- seriate(eurodist, method = "olo")
class(olo[[1]])
#> [1] "ser_permutation_vector" "hclust"
get_order(olo) # names of integer order are obtained by ordering hclust labels
#>          Athens            Rome       Gibraltar          Lisbon          Madrid 
#>               1              19               9              12              14 
#>       Barcelona      Marseilles           Lyons          Geneva           Milan 
#>               2              15              13               8              16 
#>          Munich          Vienna         Cologne Hook of Holland        Brussels 
#>              17              21               6              11               3 
#>       Cherbourg           Paris          Calais         Hamburg      Copenhagen 
#>               5              18               4              10               7 
#>       Stockholm 
#>              20
identical(x = names(get_order(olo)), y = olo[[1]]$labels[olo[[1]]$order])
#> [1] TRUE

plot(olo[[1]])


# can permute eurodist after another small bug fix (not previously possible due to missing Diag attribute)
class(permute(eurodist, mds))
#> [1] "dist"
pimage(x = eurodist, order = mds)

Created on 2022-10-30 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Ventura 13.0 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/Amsterdam #> date 2022-10-30 #> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> ca 0.71.1 2020-01-24 [1] CRAN (R 4.2.0) #> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0) #> httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> registry 0.5-1 2019-03-05 [1] CRAN (R 4.2.0) #> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> seriation * 1.4.0.9000 2022-10-30 [1] Github (david-barnett/seriation@6cdffee) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0) #> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> TSP 1.2-1 2022-07-14 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.2 2022-09-29 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ─────────────────────────────────────────────────────────────────────────── ```

Old behaviour demonstrated

# install.packages("seriation") # get cran version
library(seriation)
packageVersion("seriation")
#> [1] '1.4.0'

# examples use built-in R distance dataset eurodist
class(eurodist)
#> [1] "dist"
pimage(eurodist)


# methods like MDS produce integer ser_permutation_vector objects, with no names
mds <- seriate(eurodist, method = "MDS")
class(mds[[1]])
#> [1] "ser_permutation_vector" "integer"
get_order(mds) # no names
#>  [1]  9 12 14  2  5 15 13 18  4  8  3 11 16  6 10 17  7 19 20 21  1
names(get_order(mds))
#> NULL

# methods like OLO produce hclust ser_permutation_vector objects, with no names
olo <- seriate(eurodist, method = "olo")
class(olo[[1]])
#> [1] "ser_permutation_vector" "hclust"
get_order(olo) # no names
#>  [1]  1 19  9 12 14  2 15 13  8 16 17 21  6 11  3  5 18  4 10  7 20
names(get_order(olo))
#> NULL
olo[[1]]$labels[olo[[1]]$order] # labels and order can be used
#>  [1] "Athens"          "Rome"            "Gibraltar"       "Lisbon"         
#>  [5] "Madrid"          "Barcelona"       "Marseilles"      "Lyons"          
#>  [9] "Geneva"          "Milan"           "Munich"          "Vienna"         
#> [13] "Cologne"         "Hook of Holland" "Brussels"        "Cherbourg"      
#> [17] "Paris"           "Calais"          "Hamburg"         "Copenhagen"     
#> [21] "Stockholm"

# note: a different small bug
permute(eurodist, mds) # can't permute eurodist as it lacks Diag and Upper attributes
#> Error in attr(x, "Diag") || attr(x, "Upper"): invalid 'x' type in 'x || y'
pimage(x = eurodist, order = mds)
#> Error in attr(x, "Diag") || attr(x, "Upper"): invalid 'x' type in 'x || y'

Created on 2022-10-30 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Ventura 13.0 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/Amsterdam #> date 2022-10-30 #> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> ca 0.71.1 2020-01-24 [1] CRAN (R 4.2.0) #> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0) #> httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> registry 0.5-1 2019-03-05 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> seriation * 1.4.0 2022-10-21 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0) #> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> TSP 1.2-1 2022-07-14 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.2 2022-09-29 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
mhahsler commented 1 year ago

Hi David,

I was considering names initially, but then the permutation vectors took so much space to print with the name that I did not add them. I think consistently providing them is a good idea and I will add your commits to the code base. I had to disable snapshot testing since the order/or reverse order is not defined and depends on many things including the random number generator.

Thank you for your contribution.

stephenturner commented 10 months ago

I wonder if this is causing a test failure.

https://cran.r-project.org/web/checks/check_results_seriation.html

I got a note from CRAN today that my package will be archived because seriation is failing a test

  ══ Skipped tests (1) ═══════════════════════════════════════════════════════════
  • On CRAN (1): 'test-zzz_seriate_extra.R:38:1'

  ══ Failed tests ════════════════════════════════════════════════════════════════
  ── Failure ('test-seriate.R:267:3'): test if seriate.dist returns expected results ──
  Seriation method MDS_angle does not return the correct order!
   is not TRUE

  `actual`:   FALSE
  `expected`: TRUE 

  [ FAIL 1 | WARN 0 | SKIP 1 | PASS 327 ]
  Error: Test failures
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking re-building of vignette outputs ... [45s/176s] OK
* checking PDF version of manual ... [12s/49s] OK
* checking HTML version of manual ... [8s/29s] OK
* checking for non-standard things in the check directory ... OK
* checking for detritus in the temp directory ... OK
* DONE

Status: 1 ERROR
See
  ‘/data/gannet/ripley/R/packages/tests-OpenBLAS/seriation.Rcheck/00check.log’
for details.
mhahsler commented 10 months ago

Hi, sorry about that. The source is indeed a test in seriation. I am working on figuring out why eigen() produces such different results on OpenBLAS that the resulting seritation order is different. Maybe I need to change the example so that a small numerical change does not change the permutation. I will resolve this in the next few days.

Regards, Michael