Offer new format to support caching properly serialized torch objects

mattwarkentin commented 4 years ago

Prework

[x] Read and agree to the code of conduct and contributing guidelines.
[x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[x] New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
[x] Format your code according to the tidyverse style guide.

Proposal

Hi @wlandau,

With the recent release of {torch} for R (see here) I thought it might be useful to offer a special storage format for torch models/objects similar to what is offered for keras. I think the release of torch will see it used quite a bit for deep learning moving forward as it provides a native binding to the libtorch C++ library without requiring any python wrappers - so I think offering support for this enhances targets.

Since torch objects are really just pointers to C++ objects in memory, they can't be serialized normally since you basically just serialize a pointer and the actual object probably won't persist. Existing serialization formats for standard R objects in targets, such as saveRDS() and qs::qsave(), won't work.

torch::torch_save() and torch::torch_load() seem to do the work of properly serializing and loading torch objects and would probably give you what you need to offer a torch format.

wlandau commented 4 years ago

Definitely a great idea and something I would love to support. Reading and writing objects to storage does get us most of the way there. However, there is one more requirement: an injective transformation to serialize exportable objects in memory. keras does this with serialize_model() and unserialize_model(), and it ensures we can send data to and from distributed/parallel workers over a network.

https://github.com/wlandau/targets/blob/cb95fbdaa646d9c900e810a491d691db482f10d3/R/class_keras.R#L27-L35

Predictably, I cannot simply serialize torch objects to raw vectors in base R, and I fear torch::as_array() is not injective. (We lose information when we try to transform back.)

library(torch)
x <- array(runif(8), dim = c(2, 2, 2))
original <- torch_tensor(x, dtype = torch_float64())
# Try to serialize to raw.
tmp <- tempfile()
torch_save(original, tmp)
raw <- readBin(tmp, what = "raw")
print(raw)
#> [1] 1f
# Try to unserialize from raw.
tmp <- tempfile()
writeBin(raw, tmp)
out <- torch_load(tmp)
#> Error in readRDS(path): error reading from connection

^{Created on 2020-09-30 by the reprex package (v0.3.0)}

So I will definitely keep an eye on this but close until we have https://github.com/mlverse/torch/issues/270 or a workaround.

mattwarkentin commented 4 years ago

Hmm, interesting. Well, I'm glad I could put this on your radar and hopefully in the future there will be a solution on the torch side of things that will allow targets to offer support for this feature.

wlandau commented 4 years ago

Wait a minute: I'm totally wrong about https://github.com/wlandau/targets/issues/179#issuecomment-701703681. I forgot that readBin() weirdly defaults to 1 instead of the size of the file. Reopening.

library(torch)
x <- array(runif(8), dim = c(2, 2, 2))
original <- torch_tensor(x, dtype = torch_float64())
# Serialize to raw.
tmp <- tempfile()
torch_save(original, tmp)
raw <- readBin(tmp, what = "raw", n = file.size(tmp))
# Unserialize from raw.
tmp <- tempfile()
writeBin(raw, tmp)
torch_load(tmp)
#> torch_tensor 
#> (1,.,.) = 
#>   0.8850  0.2682
#>   0.9796  0.9439
#> 
#> (2,.,.) = 
#>   0.2353  0.4076
#>   0.1360  0.8992
#> [ CPUDoubleType{2,2,2} ]

^{Created on 2020-09-30 by the reprex package (v0.3.0)}

dfalbel commented 4 years ago

For reference, you can skip the 'save to temp file step' with something like:

library(torch)
x <- array(runif(8), dim = c(2, 2, 2))
original <- torch_tensor(x, dtype = torch_float64())

con <- rawConnection(raw(), open = "wr")
torch_save(original, con)
r <- rawConnectionValue(con)

torch_load(rawConnection(r, open = "r"))

wlandau commented 4 years ago

Even better, thanks!

wlandau commented 4 years ago

Implemented 2 new formats in https://github.com/wlandau/targets/commit/1d021cdb90cacb47dc0fe452b0e3e77d69f6adae and https://github.com/wlandau/targets/commit/549b21e1bb3c4d049f6a47b15fefc06c93b4163a:

"torch": local storage in _targets/objects/
"aws_torch": cloud storage to Amazon S3

Loving how I can implement and test without a Python env!

mattwarkentin commented 4 years ago

Amazing! Glad it all worked out.

Loving how I can implement and test without a Python env!

Haha, agreed. This is exactly why I think torch is going to gain favour in the R community compared to tensorflow/keras implementations.

ropensci / targets

Offer new format to support caching properly serialized torch objects #179

Prework

Proposal