janestreet / torch

MIT License
100 stars 9 forks source link

`v0.17~preview.129.36+325` fails to build on M2 Chip Mac w/ `libtorch` 2.3.0, OCaml 5.2.0, `opam` 2.2.1 #14

Open ShunchiZhang opened 2 weeks ago

ShunchiZhang commented 2 weeks ago

Follow the instructions in #2:

  • Download libtorch binaries (or build libtorch in your Mac). At the moment there are no official pre-build binaries. I downloaded my (unofficial) binaries from https://github.com/mlverse/libtorch-mac-m1/releases .
  • Install OCaml >= 4.14 (see here: https://opam.ocaml.org/packages/torch/)
  • Double check what libtorch version is compatible with the current version of OCaml torch. Version 1.13.1 is the one you want with v0.16.0 version of OCaml torch.
  • Set the LIBTORCH environment variable to the directory that includes the include and lib directories.

To install with opam:

opam install torch.v0.16.0 --ignore-constraints-on libtorch

I run:

cd /opt/libtorch
wget https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.3.0.zip
unzip libtorch-macos-arm64-2.3.0.zip -d v2.3.0
cat /opt/libtorch/v2.3.0/libtorch/build-version # 2.3.0

LIBTORCH=/opt/libtorch/v2.3.0/libtorch/ opam install torch.v0.17.0 --ignore-constraints-on libtorch

and get the following error:

Click to expand the terminal log ``` ➜ LIBTORCH=/opt/libtorch/v2.3.0/libtorch/ opam install torch.v0.17.0 --ignore-constraints-on libtorch The following actions will be performed: === install 1 package βˆ— torch v0.17.0 <><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><> 🐫 ⬇ retrieved torch.v0.17.0 (cached) [ERROR] The compilation of torch.v0.17.0 failed at "dune build -p torch -j 7". #=== ERROR while compiling torch.v0.17.0 ======================================# # context 2.2.1 | macos/arm64 | ocaml-base-compiler.5.2.0 | https://opam.ocaml.org#6383bc5431ca714c10b4e29dbf7eda9572a4ac07 # path ~/.opam/5.2.0/.opam-switch/build/torch.v0.17.0 # command ~/.opam/opam-init/hooks/sandbox.sh build dune build -p torch -j 7 # exit-code 1 # env-file ~/.opam/log/torch-12746-231689.env # output-file ~/.opam/log/torch-12746-231689.out ### output ### # torch_stubs_generated.c:31309:43: warning: passing 'const char *' to parameter of type 'char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers] # [...] # ^~~~~~ # ./torch_api_generated.h:2014:73: note: passing argument to parameter 'reduce' here # raw_tensor atg_segment_reduce_out(gc_tensor out, gc_tensor data, char * reduce, gc_tensor lengths, gc_tensor indices, gc_tensor offsets, int64_t axis, int unsafe, scalar initial); # ^ # torch_stubs_generated.c:35007:28: warning: passing 'const char *' to parameter of type 'char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers] # x33586, x33589, x33590, x33593, x33596); # ^~~~~~ # ./torch_api_generated.h:2336:182: note: passing argument to parameter 'pad_mode' here # raw_tensor atg_stft_center(gc_tensor self, int64_t n_fft, int64_t hop_length_v, int hop_length_null, int64_t win_length_v, int win_length_null, gc_tensor window, int center, char * pad_mode, int normalized, int onesided, int return_complex); # ^ # 125 warnings and 1 error generated. <><> Error report <><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 🐫 β”Œβ”€ The following actions failed β”‚ Ξ» build torch v0.17.0 └─ ╢─ No changes have been performed <><> torch.v0.17.0 troubleshooting ><><><><><><><><><><><><><><><><><><><><> 🐫 => Installation of ocaml-torch failed. This likely happened because there is no system installation of libtorch to compile OCaml bindings against. Please instal the CPU version of libtorch through opam, or the appropriate version of libtorch for your GPU through the official distribution. ```

All of this is done at my MacBook Air with M2 chip. The version of OCaml and opam shows as follows:

➜ opam --version
2.2.1
➜ opam switch list  
#  switch   compiler                                           description
β†’  5.2.0    ocaml-base-compiler.5.2.0,ocaml-options-vanilla.1  ocaml-base-compiler = 5.2.0 | ocaml-system = 5.2.0
   default  ocaml-base-compiler.5.2.0,ocaml-options-vanilla.1  ocaml >= 4.05.0

Thank you for helping me to resolve this issue :)

ShunchiZhang commented 2 weeks ago

TL;DR: same error with libtorch v2.1.0

Although the current README indicates the compatible version of torch is v2.3:

https://github.com/janestreet/torch/blob/e4d20dea8df4fedeabcf22fd32149ff58108a652/README.md?plain=1#L8

I notice at opam package page for torch.v0.17.0, libtorch<2.1.0Β | >=2.2.0 will result conflicts. But my try with libtorch v2.1.0 (from mlverse/libtorch-mac-m1 as there is no official build until v2.2.0) also fails with the same error above.

ShunchiZhang commented 2 weeks ago

TL;DR: same error installing v0.16.0 with libtorch v1.13.1

I just tried to build v0.16.0 with libtorch v1.13.1, but again met the same error:

Click to expand the terminal log ``` [ERROR] The compilation of torch.v0.16.0 failed at "dune build -p torch -j 7". #=== ERROR while compiling torch.v0.16.0 ======================================# # context 2.2.1 | macos/arm64 | ocaml-base-compiler.5.2.0 | https://opam.ocaml.org#6383bc5431ca714c10b4e29dbf7eda9572a4ac07 # path ~/.opam/5.2.0/.opam-switch/build/torch.v0.16.0 # command ~/.opam/opam-init/hooks/sandbox.sh build dune build -p torch -j 7 # exit-code 1 # env-file ~/.opam/log/torch-32153-10e3e4.env # output-file ~/.opam/log/torch-32153-10e3e4.out ### output ### # torch_stubs.c:34650:51: warning: passing 'const char *' to parameter of type 'char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers] # [...] # ^~~~~~ # ./torch_api_generated.h:1982:71: note: passing argument to parameter 'reduce' here # void atg_segment_reduce_out(tensor *, tensor out, tensor data, char * reduce, tensor lengths, tensor indices, tensor offsets, int64_t axis, int unsafe, scalar initial); # ^ # torch_stubs.c:38971:36: warning: passing 'const char *' to parameter of type 'char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers] # x37501, x37502, x37505, x37506, x37509, x37512); # ^~~~~~ # ./torch_api_generated.h:2300:180: note: passing argument to parameter 'pad_mode' here # void atg_stft_center(tensor *, tensor self, int64_t n_fft, int64_t hop_length_v, int hop_length_null, int64_t win_length_v, int win_length_null, tensor window, int center, char * pad_mode, int normalized, int onesided, int return_complex); # ^ # 121 warnings and 1 error generated. <><> Error report <><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 🐫 β”Œβ”€ The following actions failed β”‚ Ξ» build torch v0.16.0 ```
mwlon commented 2 weeks ago

I think you'll need 0.17.0 + libtorch 2.3 (see README for current libtorch version).

Since we don't publish mac releases of the libtorch opam package anymore, you'll need to uninstall opam torch (and opam libtorch if you have it), download the binaries manually (options 2-4 in README), set the corresponding environment variable, and reinstall opam torch. Do not install opam libtorch.

ShunchiZhang commented 2 weeks ago

I think you'll need 0.17.0 + libtorch 2.3 (see README for current libtorch version).

Since we don't publish mac releases of the libtorch opam package anymore, you'll need to uninstall opam torch (and opam libtorch if you have it), download the binaries manually (options 2-4 in README), set the corresponding environment variable, and reinstall opam torch. Do not install opam libtorch.

As I stated above, I have tried 3 below combinations with option 4 and met the same error.

ocaml-torch libtorch Reference
v0.17.0 v2.3.0 Current README (e4d20de)
v0.17.0 v2.1.0 opam Package Page
v0.16.0 v1.13.1 Solution in Issue #2

Besides, there seems to be no all target in the Makefile.

arbipher commented 2 weeks ago

Here is the error on my machine apple M3 with ocaml-torch v0.17, OCaml 5.2.0, opam 2.2.1. The LIBTORCH is set to /Users/<me>/Library/Python/3.12/lib/python/site-packages/torch at version 2.3.1

The following actions will be performed:
=== install 1 package
  βˆ— torch v0.17.0 (pinned)

Proceed with βˆ— 1 installation? [y/n] y

<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><>  🐫
⬇ retrieved torch.v0.17.0  (no changes)
[ERROR] The compilation of torch.v0.17.0 failed at "dune build -p torch -j 15".

#=== ERROR while compiling torch.v0.17.0 ======================================#
# context     2.2.1 | macos/arm64 | ocaml.5.2.0 | pinned(git+https://github.com/janestreet/torch.git#e4d20dea8df4fedeabcf22fd32149ff58108a652)
# path        ~/.opam/default/.opam-switch/build/torch.v0.17.0
# command     ~/.opam/opam-init/hooks/sandbox.sh build dune build -p torch -j 15
# exit-code   1
# env-file    ~/.opam/log/torch-25252-f5afb6.env
# output-file ~/.opam/log/torch-25252-f5afb6.out
### output ###
# ./torch_api_generated.h:2385:182: note: passing argument to parameter 'pad_mode' here
# [...]
# 129 warnings and 1 error generated.
# (cd _build/default && /Users/ex/.opam/default/bin/ocamlc.opt -w -40 -g -bin-annot -bin-annot-occurrences -I src/torch/.torch.objs/byte -I /Users/ex/.opam/default/lib/base -I /Users/ex/.opam/default/lib/base/base_internalhash_types -I /Users/ex/.opam/default/lib/base/md5 -I /Users/ex/.opam/default/lib/base/shadow_stdlib -I /Users/ex/.opam/default/lib/base_bigstring -I /Users/ex/.opam/default/l[...]
# File "src/torch/optimizer.ml", line 148, characters 18-40:
# 148 |       let index = Option.value_local_exn index in
#                         ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value "Option.value_local_exn"
# (cd _build/default && /Users/ex/.opam/default/bin/ocamlopt.opt -w -40 -g -I src/torch/.torch.objs/byte -I src/torch/.torch.objs/native -I /Users/ex/.opam/default/lib/base -I /Users/ex/.opam/default/lib/base/base_internalhash_types -I /Users/ex/.opam/default/lib/base/md5 -I /Users/ex/.opam/default/lib/base/shadow_stdlib -I /Users/ex/.opam/default/lib/base_bigstring -I /Users/ex/.opam/default/l[...]
# File "src/torch/optimizer.ml", line 148, characters 18-40:
# 148 |       let index = Option.value_local_exn index in
#                         ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value "Option.value_local_exn"

I can see the same error for both opam install torch or opam pin torch https://github.com/janestreet/torch.git.

arbipher commented 2 weeks ago

Ok, after warming up some old memory on this code in February and some trial and error, it now builds and runs on my mac. It needs OCaml 5.1.1 because PyML needs 5.1.1 (due to stdcompact). My PR should be agnostic of this.

> dune exec examples/basics/basics.exe
cuda available: false                  
cudnn available: false
42
[ CPUFloatType{} ]
mwlon commented 2 weeks ago

@ShunchiZhang ah I missed that you had tried that combination already. I've been able to replicate the error now, will try @arbipher's fix

arbipher commented 2 weeks ago

Hi @mwlon

This post was obsolete. See my newer reply.

I found another problem that with my fix or the original code that

dune build always works but dune build -p torch (which opam install uses) will raise fatal error: 'torch_api_generated.cpp' file not found.

dune build src/wrapper also works.

It seems torch_api is not specified in any dune files that library torch can refer to. It's not a problem for dune build because it may try all targets. No ideas for this yet.

When using dune build -p torch, torch.install will never trigger those gen_{bindings,stubs} alias, therefore the building fail fast on the 4th subtask.

File "src/wrapper/dune", line 4, characters 9-18:
4 |   (names torch_api)
             ^^^^^^^^^
(cd _build/default/src/wrapper && /usr/bin/cc -std=c++17 -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /Users/ex/Library/Python/3.12/lib/python/site-packages/torch/include -isystem /Users/ex/Library/Python/3.12/lib/python/site-packages/torch/include/torch/csrc/api/include -g -I /Users/ex/.opam/5.1.1/lib/ocaml -I /Users/ex/.opam/5.1.1/lib/base -I /Users/ex/.opam/5.1.1/lib/base/base_internalhash_types -I /Users/ex/.opam/5.1.1/lib/base/md5 -I /Users/ex/.opam/5.1.1/lib/base/shadow_stdlib -I /Users/ex/.opam/5.1.1/lib/base_quickcheck -I /Users/ex/.opam/5.1.1/lib/base_quickcheck/ppx_quickcheck/runtime -I /Users/ex/.opam/5.1.1/lib/bigarray-compat -I /Users/ex/.opam/5.1.1/lib/bin_prot -I /Users/ex/.opam/5.1.1/lib/bin_prot/shape -I /Users/ex/.opam/5.1.1/lib/ctypes -I /Users/ex/.opam/5.1.1/lib/ctypes-foreign -I /Users/ex/.opam/5.1.1/lib/ctypes/stubs -I /Users/ex/.opam/5.1.1/lib/fieldslib -I /Users/ex/.opam/5.1.1/lib/integers -I /Users/ex/.opam/5.1.1/lib/jane-street-headers -I /Users/ex/.opam/5.1.1/lib/ocaml/str -I /Users/ex/.opam/5.1.1/lib/ocaml/threads -I /Users/ex/.opam/5.1.1/lib/ocaml/unix -I /Users/ex/.opam/5.1.1/lib/ocaml_intrinsics_kernel -I /Users/ex/.opam/5.1.1/lib/parsexp -I /Users/ex/.opam/5.1.1/lib/ppx_assert/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_bench/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_compare/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_enumerate/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_expect/config -I /Users/ex/.opam/5.1.1/lib/ppx_expect/config_types -I /Users/ex/.opam/5.1.1/lib/ppx_expect/make_corrected_file -I /Users/ex/.opam/5.1.1/lib/ppx_expect/runtime -I /Users/ex/.opam/5.1.1/lib/ppx_hash/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_here/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_inline_test/config -I /Users/ex/.opam/5.1.1/lib/ppx_inline_test/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_log/syntax -I /Users/ex/.opam/5.1.1/lib/ppx_log/types -I /Users/ex/.opam/5.1.1/lib/ppx_module_timer/runtime -I /Users/ex/.opam/5.1.1/lib/ppx_sexp_conv/runtime-lib -I /Users/ex/.opam/5.1.1/lib/ppx_stable_witness/runtime -I /Users/ex/.opam/5.1.1/lib/ppx_stable_witness/stable_witness -I /Users/ex/.opam/5.1.1/lib/ppx_string/runtime -I /Users/ex/.opam/5.1.1/lib/ppxlib/print_diff -I /Users/ex/.opam/5.1.1/lib/sexplib -I /Users/ex/.opam/5.1.1/lib/sexplib0 -I /Users/ex/.opam/5.1.1/lib/splittable_random -I /Users/ex/.opam/5.1.1/lib/stdio -I /Users/ex/.opam/5.1.1/lib/stdlib-shims -I /Users/ex/.opam/5.1.1/lib/time_now -I /Users/ex/.opam/5.1.1/lib/typerep -I /Users/ex/.opam/5.1.1/lib/variantslib -I ../bindings -o torch_api.o -c torch_api.cpp)
torch_api.cpp:903:10: fatal error: 'torch_api_generated.cpp' file not found
  903 | #include "torch_api_generated.cpp"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
-> required by _build/default/src/wrapper/torch_api.o
-> required by _build/default/src/wrapper/dlltorch_core_stubs.so
-> required by _build/install/default/lib/stublibs/dlltorch_core_stubs.so
-> required by _build/default/torch.install
-> required by alias install
arbipher commented 2 weeks ago

I cannot use my mac with M3 in the weekend but I tested it with my wsl. Now both dune build -p torch and dune build compile without problems.

There is a subtle concern on my editing in src/wrapper/dune

  (flags
   ;-Wincompatible-pointer-types ; if using gcc
   -Wno-error=incompatible-function-pointer-types ; if using clang
   )

however, I cannot figure out how to write the correct stanza for these conditional flags. It will only bother gcc users.

arbipher commented 1 week ago

It also works with my OCaml 5.2.0. PyML is only used in some examples so it's not required if users just install this package (or dune build -p torch).

mwlon commented 1 week ago

I've released a fix internally, borrowing from @arbipher's PR. It should propagate out later. I'll try to get a corrected version 0.17.1 released later as well.