torch fails on new Mac M3 architecture

gilbertocamara commented 6 months ago

Dear @dfalbel I have bought a new MacBook Air with the M3 chip which has 8 CPUs, 10 GPUs and 16GB integrated memory. My R torch apps are crashing. I have put together a MWE which works on all other architectures, including in MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()

# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

The error occurs in the luz::fit function. Inside RStudio, the code gets stuck and then RStudio asks to restart R. When running R from the terminal, the output is:

 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R

The sessionInfo() output is as follows:


R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5

dfalbel commented 6 months ago

Can you show me the output of torch::install_torch(reinstall = TRUE) ? Also, I'assuming it doesnt fail if you run eg:torch_randn(10)`?

gilbertocamara commented 6 months ago

Sure!

torch::install_torch(reinstall = TRUE)
trying URL 'https://github.com/mlverse/libtorch-mac-m1/releases/download/LibTorch-for-R/libtorch-v2.0.1.zip'
Content type 'application/octet-stream' length 49631992 bytes (47.3 MB)
==================================================
downloaded 47.3 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.12.0.9000+cpu+arm64-Darwin.zip'
Content type 'application/zip' length 3602457 bytes (3.4 MB)
==================================================
downloaded 3.4 MB

✔ torch dependencies have been installed.
ℹ You must restart your session to use torch correctly.

Running a simple command such as torch_randn(10) works.

torch::torch_randn(10)
torch_tensor
 0.8753
 0.9061
-1.8905
-0.2683
-0.4204
-0.3306
 1.1119
 0.0052
 0.3246
-0.2530
[ CPUFloatType{10} ]

torch also can access the M3 MPS. The following works.

x <- torch::torch_randn(10, 10)$to(device="mps")
y <- torch::torch_randn(10, 10)$to(device="mps")

torch::torch_mm(x, y)

The problems appear on the luz::fit() function. We compiled the lantern library from source, and tried to install it as follows.

# compiled lantern from source and configured env variables as follows
devtools::install(build = FALSE)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /Users/gilberto/torch --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library’
* installing *source* package ‘torch’ ...
** using staged installation
CMAKE_FLAGS: 
** libs
con compilatore C++: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
con SDK: ‘MacOSX14.4.sdk’
*** Building lantern!
mkdir -p ../build-lantern
cd ../build-lantern && cmake ../src/lantern -DCMAKE_INSTALL_PREFIX=/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/00LOCK-torch/00new/torch -DCMAKE_INSTALL_MESSAGE="LAZY"  && cmake --build . --target install --config Release
### Lots of output...
-- Build files have been written to: /Users/gilberto/torch/build-lantern

## We then configured the env variables
Sys.setenv(LANTERN_URL="/Users/gilberto/torch/build-lantern")
Sys.setenv(TORCH_URL="/Users/gilberto/torch/build-lantern/libtorch")
## We then tried to install torch after this, but if falis

Either there is a problem with the lantern code when using M3, or we have failed to install correctly after compiling from source.

dfalbel commented 6 months ago

You might want to try setting the env var BUILD_LANTERN=1 then running remotes::install_github("mlverse/torch") to build lantern from source. Although, I don't think lantern is the culprit here, as it's just a relatively thin wrapper around LibTorch. You might also need to build LibTorch from source.

dfalbel commented 6 months ago

Also, have you tried installing pre-built binaries from with eg:

kind <- "cpu"
version <- "0.12.0.9000"
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch", type = "binary")

gilbertocamara commented 6 months ago

Thanks! I have tried, but failed.

dfalbel commented 6 months ago

Can you also try disabling MPS on luz, just so we can narrow a little more the problem.

You can do something like:

torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE,
  accelerator = accelerator(cpu = TRUE)
)

gilbertocamara commented 6 months ago

Works!!! Can we now make luz work on MPS?

dfalbel commented 6 months ago

I think we will need to figure out why torch fails on M3 + MPS for that model. I believe it's possible that you will need to build LibTorch from source to fix this issue.

gilbertocamara commented 6 months ago

How do I build libtorch and liblantern from source?

dfalbel commented 6 months ago

To build LibTorch from source, you can follow instructions the steps in this workflow file:

https://github.com/mlverse/libtorch-mac-m1/blob/main/.github/workflows/libtorch.yaml

Then copy the libtorch files into src/lantern/build and run load_all or dev tools::install with BUILD_LANTERN=1 set.

gilbertocamara commented 6 months ago

Thanks!! I will try

gilbertocamara commented 6 months ago

Dear @dfalbel we tried to build torch from source, but it did not work on Mac M3 chip. Looking at the pytorch github, other developers are having similar problems with the new M3 chip. Please see the following issue:

https://github.com/pytorch/pytorch/issues/125803

DenaJGibbon commented 6 months ago

Hello. I had a similar issue, but after I upgraded to macOS Sonoma 14.4.1 on a Mac M2. I posted on the Luz GitHub, but was happy to see some discussion here.

https://github.com/mlverse/luz/issues/143

mlverse / torch

torch fails on new Mac M3 architecture #1167