Closed ashok-arora closed 1 year ago
Yes, we could give that a try. Would you like to make a pull request?
Sure! I'll start working on it.
Could you start by making the precommit tests work on M1?
@tkoeppe So for precommit tests to work, the following commands should run on M1 right?
sudo -H pip3 install numpy packaging
bazel --bazelrc=.bazelrc build --config=lua5_1 --config=libc++ //...
bazel --bazelrc=.bazelrc test --config=lua5_1 --config=libc++ --test_output=errors //...
bazel --bazelrc=.bazelrc build --config=lua5_2 --config=libc++ //...
bazel --bazelrc=.bazelrc test --config=lua5_2 --config=libc++ --test_output=errors //...
Yes, that's right, the same as in the existing precommit script, basically.
I don't have experience with clang and linkers, but I tried to fix a few issues, the first I received was:
clang: error: invalid linker name in argument '-fuse-ld=lld'
which I resolved through
brew install llvm
fish_add_path /opt/homebrew/opt/llvm/bin
but now I am getting this error and not sure what this one means:
ld64.lld: error: library not found for -lenv_c_api_example
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Hi! You don't need to use LLD if you don't have it, you can just drop the -fuse-ld=lld
flag. Any linker should do the job I think.
That said, what appears to be happening here is that the tests are somehow built in dynamic mode, which somehow goes wrong. I don't have much experience with Bazel's --dynamic_mode=fully
on MacOS, but just for a quick check, could you run all bazel commands with --dynamic_mode=off
? (You can stick that in the .bazelrc
file if you like, or just modify the individual commands.)
@tkoeppe So I dropped out the linker flag and added the --dynamic_mode=off
option to .bazelrc
but it still gave the same error.
↪ bazel --bazelrc=.bazelrc build --config=lua5_1 --config=libc++ //... --sandbox_debug (base)
INFO: Analyzed 176 targets (1 packages loaded, 364 targets configured).
INFO: Found 176 targets...
ERROR: /Users/ashok/Desktop/lab2d/third_party/rl_api/BUILD:52:8: Linking third_party/rl_api/env_c_api_example_test failed: (Exit 1): sandbox-exec failed: error executing command
(cd /private/var/tmp/_bazel_ashok/93028e41a3bbf195060aa75d4a59b30f/sandbox/darwin-sandbox/1408/execroot/org_deepmind_lab2d && \
exec env - \
PATH='/Users/ashok/miniconda/bin:/Users/ashok/miniconda/condabin:/opt/homebrew/opt/llvm/bin:/opt/homebrew/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Applications/VMware Fusion Tech Preview.app/Contents/Public:/Library/Frameworks/Mono.framework/Versions/Current/Commands' \
PWD=/proc/self/cwd \
TMPDIR=/var/folders/jz/02f7nm712956gy2krdl0wlzm0000gn/T/ \
/usr/bin/sandbox-exec -f /private/var/tmp/_bazel_ashok/93028e41a3bbf195060aa75d4a59b30f/sandbox/darwin-sandbox/1408/sandbox.sb /var/tmp/_bazel_ashok/install/492267a38d620c777a8b281ba7e366a6/process-wrapper '--timeout=0' '--kill_delay=15' '--stats=/private/var/tmp/_bazel_ashok/93028e41a3bbf195060aa75d4a59b30f/sandbox/darwin-sandbox/1408/stats.out' external/local_config_cc/cc_wrapper.sh @bazel-out/darwin_arm64-fastbuild/bin/third_party/rl_api/env_c_api_example_test-2.params)
ld64.lld: error: library not found for -lenv_c_api_example
clang: error: linker command failed with exit code 1 (use -v to see invocation)
INFO: Elapsed time: 44.707s, Critical Path: 13.12s
INFO: 355 processes: 60 internal, 295 darwin-sandbox.
FAILED: Build did NOT complete successfully
Oh, wait, I see: this is a special test that explicitly as a precompiled DSO in its srcs
.
Does everything else work? This one test isn't so important.
I think what might be happening here is that somehow bazel turns the dependency on libenv_c_api_example.so
into a flag -lenv_c_api_example
(instead of something like -l:libenv_c_api_example.so
, which anyway may be a GNU-specific syntax), but on MacOS, -lfoo
means "search for libfoo.dylib
". Two very quick things you could try are to rename the target libenv_c_api_example.so
into something else, e.g. a) env_c_api_example.so
, or b) libenv_c_api_example.dylib
(and then update the srcs
in https://github.com/deepmind/lab2d/blob/94e37d189aee8d309bc1dfae7227676c77636e88/third_party/rl_api/BUILD#L58 accordingly).
Hi @tkoeppe, thank you very much for the help, renaming it to libenv_c_api_example.dylib
led to a successful built! 🎉
Although, two of the tests from lua5_1
failed due to a timeout error. Here are the logs:
Aha, great! Can you also try renaming it to something that doesn't start with "lib"? I wonder if we can find some solution that works on both platforms. I also wonder how the current x86-MacOS tests are passing, which don't seem to run into this problem.
For the tests, can you please run them with --logtostderr
, e.g. add blaze test [...] --test_arg=--logtostderr
? And increase the timeout, --test_timeout=3000
? Those tests shouldn't really be taking long; I suspect that something is actually broken inside, but without more output it's impossible to tell.
Hi @tkoeppe, I tried to name it without the preceding lib
but it keeps failing. I ran into a different error this time with running the tests with the suggested args, here's the output:
Ah OK, never mind, that's apparently not using Abseil's logging framework in the tests.
There isn't any output in the logs. I expect this is some DSO loading failure. Can you run the test directly?
bazel --bazelrc=.bazelrc build --config=lua5_1 --config=libc++ //dmlab2d/lib/system/image:image_test
This tells you where the binary is, then run that directly. It should be something like:
bazel-bin/dmlab2d/lib/system/image_image_test
See if that prints anything useful at all. Then also check its return value ($?
).
On Linux I would recommend running with LD_DEBUG=all
to show loader issues, but I don't know if there is some corresponding debug tool for the MacOS loader I'm afraid.
Running the binary directly returned:
↪ ./bazel-bin/dmlab2d/lib/system/image/image_test
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LuaUnitTest
[ RUN ] LuaUnitTest.RunsTest
dmlab2d/lib/testing/lua_unit_test.cc:34: Failure
Expected: (test_script) != (nullptr), actual: NULL vs (nullptr)
Must pass test_script as first arg!
[ FAILED ] LuaUnitTest.RunsTest (0 ms)
[----------] 1 test from LuaUnitTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] LuaUnitTest.RunsTest
1 FAILED TEST
with a return value of 1
.
Oh wait, do I have to pass some config file for test suite to image_test
binary?
Oh, yes, sorry, give me a second. I can test this on the Linux version :-)
Hey @tkoeppe, I managed to get more verbose output using
↪ LD_DEBUG=all bazel --bazelrc=.bazelrc test --test_arg=--logtostderr --test_timeout=3000 --config=lua5_1 --config=libc++ --test_output=all --cache_test_results=no //... --sandbox_debug > info.log 2> errors.log
Please find attached the info.log
and errors.log
files.
To run locally, indeed the dmlab2d_lua_test
macro adds two command line arguments that allow the test to locate the script and library directories: https://github.com/deepmind/lab2d/blob/94e37d189aee8d309bc1dfae7227676c77636e88/dmlab2d/lib/system/image/BUILD#L49
To do this manually, you can do (after building):
cd bazel-bin/dmlab2d/lib/system/image/image_test.runfiles/org_deepmind_lab2d
TEST_SRCDIR=$(pwd) ./dmlab2d/lib/system/image/image_test image_test dmlab2d/lib/system/image
(bazel test
takes care of setting the TEST_SRCDIR
environment, and the dmlab2d_lua_test
macro computes the two command line arguments. You can see the full invocation printed as part of the logs of the bazel test
run.)
The LD_DEBUG
environment had no effect. It's a Linux-only thing; the environment variable is consumed by Linux's ELF interpreter. There's no such thing on MacOS, and I don't know how to debug dynamic library loading on MacOS I'm afraid.
Yes @tkoeppe, I tried to look for macOS equivalent of LD_DEBUG
but couldn't find any. Although, adding the --test_output=all
to the command did give a verbose output, was that any helpful?
Also, since the code seems to be building on mac now, can we proceed with that and leave the tests out?
Does the image_test
pass now? It seems so from your log? What changed?
Yes, go ahead and make a pull request that updates the precommit script to add a case that uses the M1 architecture.
I'm not sure what to do with the libenv_c_api_example.so rule; we could maybe make the name conditional on some select
able condition. But I'd like to find out why the x86-MacOS version doesn't have a problem with this.
Does the
image_test
pass now? It seems so from your log? What changed?
Yes, it did pass, although I am not sure what changed. Earlier, It passed because of the --test_timeout=3000
flag but now it seems to be passing without it.
Hey @tkoeppe, did you see the latest comment in the PR?
Hey @charlesbeattie, @tkoeppe seems to be busy, could someone else look into it? Additionally, is there any further work that I can contribute to this repository?
Hi - I've been trying to figure out how to reliably set an action runner so that we can predictably get x86 and ARM machines, but without success so far. I think we'll want something robust in this regard.
One thing that we could really use is getting LuaJit to work on MacOS -- if that's something you could work on, that'd be great!
Hi @tkoeppe,
Hi - I've been trying to figure out how to reliably set an action runner so that we can predictably get x86 and ARM machines, but without success so far. I think we'll want something robust in this regard.
The ARM machines are in beta right so does the action runner provide them for lab2d
?
One thing that we could really use is getting LuaJit to work on MacOS -- if that's something you could work on, that'd be great!
Sure, I'll get to working on it, there's a homebrew package for it - https://formulae.brew.sh/formula/luajit so what exactly do I have to work on?
We already have luajit downloaded from source, as an external dependency, and we build it from source on Linux. The task would be to get it to build on MacOS, too. I think I actually have some draft changes for that somewhere, but I haven't managed to get it to work yet. E.g. there's some machine code generation for luajit which needs a case added for Mach-O binaries, I think.
To get back to the .so/.dylib issue: can you say again which build configuration you are ultimately using (e.g. what's in your .bazelrc regarding -fuse-ld
, and did you install LLD locally in some way)?
We already have luajit downloaded from source, as an external dependency, and we build it from source on Linux. The task would be to get it to build on MacOS, too. I think I actually have some draft changes for that somewhere, but I haven't managed to get it to work yet. E.g. there's some machine code generation for luajit which needs a case added for Mach-O binaries, I think.
Is there any advantage of building from source as compared to downloading the package from homebrew?
To get back to the .so/.dylib issue: can you say again which build configuration you are ultimately using (e.g. what's in your .bazelrc regarding
-fuse-ld
, and did you install LLD locally in some way)?
I had installed llvm
through brew and then the error dissapeared, and I didn't make any changes to the .bazelrc
.
To get back to the .so/.dylib issue: can you say again which build configuration you are ultimately using (e.g. what's in your .bazelrc regarding
-fuse-ld
, and did you install LLD locally in some way)?I had installed
llvm
through brew and then the error dissapeared, and I didn't make any changes to the.bazelrc
.
I believe the problem is that LLD doesn't work with the .so
filename, whereas LD64 works fine. The configuration has been somewhat broken so far in that we never actually used LLD on MacOS. I wonder how you ever got to get the error clang: error: invalid linker name in argument '-fuse-ld=lld'
; I don't see how the original configuration would have ever set this flag. I think what was originally in the .bazelrc file was actually entirely ineffective. Hmmm.
We already have luajit downloaded from source, as an external dependency, and we build it from source on Linux. The task would be to get it to build on MacOS, too. I think I actually have some draft changes for that somewhere, but I haven't managed to get it to work yet. E.g. there's some machine code generation for luajit which needs a case added for Mach-O binaries, I think.
Is there any advantage of building from source as compared to downloading the package from homebrew?
and what about this @tkoeppe?
To get back to the .so/.dylib issue: can you say again which build configuration you are ultimately using (e.g. what's in your .bazelrc regarding
-fuse-ld
, and did you install LLD locally in some way)?I had installed
llvm
through brew and then the error dissapeared, and I didn't make any changes to the.bazelrc
.I believe the problem is that LLD doesn't work with the
.so
filename, whereas LD64 works fine. The configuration has been somewhat broken so far in that we never actually used LLD on MacOS. I wonder how you ever got to get the errorclang: error: invalid linker name in argument '-fuse-ld=lld'
; I don't see how the original configuration would have ever set this flag. I think what was originally in the .bazelrc file was actually entirely ineffective. Hmmm.
So what should I replace it with?
My end goal is to make it easier to install melting-pot
on m1 mac by installing them through pypi and I want to package them all for pypi.
We already have luajit downloaded from source, as an external dependency, and we build it from source on Linux. The task would be to get it to build on MacOS, too. I think I actually have some draft changes for that somewhere, but I haven't managed to get it to work yet. E.g. there's some machine code generation for luajit which needs a case added for Mach-O binaries, I think.
Is there any advantage of building from source as compared to downloading the package from homebrew?
and what about this @tkoeppe?
I didn't have a good answer there, at least I don't know if you find it convincing. Bazel is designed around hermetic, self-contained builds, and that's how we made Lab2D internally, so it was just the natural way to do it. And it works well on Linux. So it's mainly that I don't really want to add a new way of doing things if we can just keep doing it that way. Minor benefits are more compact deployment and better opportunities for compiler and link-time optimisations, though that's probably not terribly persuasive.
To get back to the .so/.dylib issue: can you say again which build configuration you are ultimately using (e.g. what's in your .bazelrc regarding
-fuse-ld
, and did you install LLD locally in some way)?I had installed
llvm
through brew and then the error dissapeared, and I didn't make any changes to the.bazelrc
.I believe the problem is that LLD doesn't work with the
.so
filename, whereas LD64 works fine. The configuration has been somewhat broken so far in that we never actually used LLD on MacOS. I wonder how you ever got to get the errorclang: error: invalid linker name in argument '-fuse-ld=lld'
; I don't see how the original configuration would have ever set this flag. I think what was originally in the .bazelrc file was actually entirely ineffective. Hmmm.So what should I replace it with?
Just don't set any -fuse-ld
flag, and the default behaviour should be that the native MacOS linker LD64 is used. That one can handle the DSO being named ".so".
Hm, it could also be that you have to make sure to not have installed LLVM locally, since that might be found in preference to the system one. Or if that is indeed your setup, you could still make bazel use the system's clang binary by ponting the CC
and CXX
environment variables at it (e.g. see https://github.com/deepmind/lab2d/blob/34b4e923ae51dfdcf57782a39775c8debc364920/.github/workflows/precommit.yml#L35-L36).
In any case it should work fine on GitHub, so do revert the "so"-to-"dylib" change in your pull request in any case.
I didn't have a good answer there, at least I don't know if you find it convincing. Bazel is designed around hermetic, self-contained builds, and that's how we made Lab2D internally, so it was just the natural way to do it. And it works well on Linux. So it's mainly that I don't really want to add a new way of doing things if we can just keep doing it that way. Minor benefits are more compact deployment and better opportunities for compiler and link-time optimisations, though that's probably not terribly persuasive.
Hey @tkoeppe, I understand. I will try my best to make it work on macOS too. I looked around and found this luajit.rb in the homebrew repo. Would this be helpful in porting it to a bazel build?
Or if that is indeed your setup, you could still make bazel use the system's clang binary by ponting the CC and CXX environment variables at it
Oh right, thats a good idea. I'll add it to the file.
Please see https://github.com/deepmind/lab2d/pull/26#issuecomment-1557135662: I can't find any evidence that the GitHub runners use M1 hardware.
Hey @tkoeppe, I understand. I will try my best to make it work on macOS too. I looked around and found this luajit.rb in the homebrew repo. Would this be helpful in porting it to a bazel build?
We already have the dependency on LuaJit in Bazel. We just need to make it compile on MacOS. Check out the Linux version, that one already uses it.
I can check in my existing draft changes to get closer to making LuaJit buildable if that helps.
(Looks like we just fixed the ".so"/".dylib" issue by patching LLD: https://reviews.llvm.org/D151147)
Hey @tkoeppe, sorry I've been busy in travelling this week and did not have access to my mac machine to test out the changes.
I'll get to it in a couple of days.
I have ARM builds working in https://github.com/deepmind/lab2d/actions/runs/5093675265/jobs/9156577098 via cross-compilation: the runner is still using an x86 mac, but I told clang to compile for ARM.
For LuaJit, I also sort of got that working, but a major problem is that we've pinned it to the last released version (2.1.0v3-beta) from 2017, and it desparately needs an update to work (well) with new Macs.
(Looks like we just fixed the ".so"/".dylib" issue by patching LLD: https://reviews.llvm.org/D151147)
Oh wow, thank you so much for reporting it upstream to bazel, your report here was so detailed and well written.
I have ARM builds working in https://github.com/deepmind/lab2d/actions/runs/5093675265/jobs/9156577098 via cross-compilation: the runner is still using an x86 mac, but I told clang to compile for ARM.
Yeah, good idea @tkoeppe, cross compilation seems a good option, is there any drawback to it? In the mean time, I have filed a support ticket with GitHub support regarding actions for m1 ARM machines.
For LuaJit, I also sort of got that working, but a major problem is that we've pinned it to the last released version (2.1.0v3-beta) from 2017, and it desparately needs an update to work (well) with new Macs.
I'm back on my machine, should I try to build LuaJit on mine too?
I set up a cross-compling wheel build. Have a look at the example output, and maybe give it a try and see if it actually works? https://github.com/deepmind/lab2d/releases/tag/test_arm64
I'll file a new issue about LuaJit.
I downloaded the dmlab2d-1.0-cp310-cp310-macosx_13_0_arm64.whl file and ran this piece to code:
import dmlab2d
import dmlab2d.runfiles_helper
lab = dmlab2d.Lab2d(dmlab2d.runfiles_helper.find(), {"levelName": "chase_eat"})
env = dmlab2d.Environment(lab, ["WORLD.RGB"])
env.step({})
It ran successfully but there was no output, is there supposed to be some output?
Hey @tkoeppe can you please look into the above code? Also, if cross-compiling is working and building the .whl
files, can we also publish them on PyPI?
I think there's a full example random agent shown in the documentation somewhere, can you try running that?
I think there's a full example random agent shown in the documentation somewhere, can you try running that?
Oh okay, so I ran the dmlab2d/random_agent.py
and here's the output:
https://github.com/deepmind/lab2d/assets/19496434/4296d7ef-80c9-468b-8111-9d2fc250e76f
Is this the expected output or is there a bug?
I think if it behaves roughly the same as on Linux, then it should be fine!
It's good to know that the cross-compilation is working. I'll try to also get LuaJit to work on MacOS, which is something I've wanted to do for some time, and then we can hopefully enable both LuaJit and the ARM64 cross-compilation mode and make a new release.
Hey, I noticed in workflows file that it builds the wheels for various x86_64 platforms.
GitHub has recently announced support for M1 macs in actions as well, so would it be possible to build M1 wheel files as well?
This will help in avoiding the hassle of building it from source.