google-deepmind / envlogger

A tool for recording RL trajectories.
Apache License 2.0
91 stars 13 forks source link

importing acme and envlogger crashes #3

Closed Vincentx15 closed 2 years ago

Vincentx15 commented 2 years ago

Hello,

Using a very minimal setup I have an error on import, I guess the name StatusCode is used by both programs in absl...

conda create -n test_envlogger python=3.7 pip
conda activate test_envlogger
pip install tensorflow gym dm-acme[tensorflow] envlogger
python -c "import acme; import envlogger"
Traceback (most recent call last):
  File "temp_test.py", line 2, in <module>
    import envlogger
  File "/home/vmallet/anaconda3/envs/test_3/lib/python3.7/site-packages/envlogger/__init__.py", line 18, in <module>
    from envlogger import environment_logger
  File "/home/vmallet/anaconda3/envs/test_3/lib/python3.7/site-packages/envlogger/environment_logger.py", line 32, in <module>
    from envlogger.backends import riegeli_backend_writer
  File "/home/vmallet/anaconda3/envs/test_3/lib/python3.7/site-packages/envlogger/backends/riegeli_backend_writer.py", line 24, in <module>
    from envlogger.backends.python import riegeli_dataset_writer
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

I hope you can help me !

Best, Vincent

kenjitoyama commented 2 years ago

Thanks for the report, @Vincentx15! I can confirm it's currently broken on a clean python:3.7 docker image.

@sabelaraga do you know if Acme has changed anything recently with imports? riegeli_dataset_writer is a pybind11 module which imports absl's Status stuff with pybind11::google::ImportStatusModule();. Is that where things are going wrong?

kenjitoyama commented 2 years ago

I don't have a solution yet, but I gathered some facts.

Acme uses DeepMind's Launchpad (version 0.5.0), which contains Courier (a communication library) which contains pybind11 code (and it uses absl StatusCode). I can confirm that if we invert the import order (i.e. "import envlogger; import acme") we see the error blow up in Launchpad:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/acme/__init__.py", line 35, in <module>
    from acme.environment_loop import EnvironmentLoop
  File "/usr/local/lib/python3.7/dist-packages/acme/environment_loop.py", line 26, in <module>
    from acme.utils import signals
  File "/usr/local/lib/python3.7/dist-packages/acme/utils/signals.py", line 22, in <module>
    import launchpad
  File "/usr/local/lib/python3.7/dist-packages/launchpad/__init__.py", line 36, in <module>
    from launchpad.nodes.courier.node import CourierHandle
  File "/usr/local/lib/python3.7/dist-packages/launchpad/nodes/courier/node.py", line 21, in <module>
    import courier
  File "/usr/local/lib/python3.7/dist-packages/courier/__init__.py", line 26, in <module>
    from courier.python.client import Client  # pytype: disable=import-error
  File "/usr/local/lib/python3.7/dist-packages/courier/python/client.py", line 30, in <module>
    from courier.python import py_client
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

EnvLogger uses pybind11_abseil version 7cf76381300671ade35127cb25566d588c69e717 (Apr 28, 2021). Launchpad uses pybind11_abseil version dc0479f3973b062783958da333a235bc528bfd22 (Jan 19, 2021).

In both of these versions, the registration of Status objects is done unconditionally. In a more recent pybind11_abseil (Feb 02, 2022, 1bb411eb1b13440d5af61660e70e8c5b5b2998a1), a conditional import was added which I believe should fix the problem.

I tried compiling EnvLogger with the latest version of pybind11_abseil, but when importing in Python the old code path is still followed. This is probably because python extensions are shared libraries and because we haven't changed Launchpad's version, so that gets picked up and EnvLogger's version is ignored.

I also tried to manually load the latest version of pybind11_abseil by hand with python -c "from ctypes import *; cdll.LoadLibrary('/envlogger/bazel-out/k8-opt/bin/external/pybind11_abseil/pybind11_abseil/status.so'); import acme; import envlogger", but it didn't make a difference, I still get the error.

Unfortunately Acme and Launchpad do not use Bazel, so I can't easily test them all into a single build system (which would force the use of a single version).

I'm not sure what's the best path forward. Maybe we can upgrade EnvLogger's pybind11_abseil version to head and check with Launchpad authors to see if they can also upgrade their version to see if this gets resolved. On our end it shouldn't be too bad, but I don't know if it's annoying for them to do so.

I'll keep this bug open.

Cheers,

Daniel

sabelaraga commented 2 years ago

We are trying to upgrade Launchpad, let's see if that solves the issue!

Vincentx15 commented 2 years ago

Thanks a lot folks !

Vincentx15 commented 2 years ago

Hello, any news ?

I have tried to install the freshest installs of acme and envlogger. I was able to compile acme from the git repo, but not envlogger. I had a bug during the fetching of zlib, it is pointing to a 404. I was not able to fix it by modifying the bazel workspace to look for the 1.2.12 version...

Repository zlib instantiated at:
  /home/vmallet/projects/emaze/temp_test/envlogger/WORKSPACE.bazel:28:14: in <toplevel>
 /home/vmallet/.cache/bazel/_bazel_vmallet/f7bd20dde09ae44071ab1b18d39dacd1/external/com_google_protobuf/protobuf_deps.bzl:9:21: in protobuf_deps
Repository rule http_archive defined at:
  /home/vmallet/.cache/bazel/_bazel_vmallet/f7bd20dde09ae44071ab1b18d39dacd1/external/bazel_tools/tools/build_defs/repo/http.bzl:336:31: in <toplevel>
WARNING: Download from https://zlib.net/zlib-1.2.11.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found
ERROR: An error occurred during the fetch of repository 'zlib':
   Traceback (most recent call last):
    File "/home/vmallet/.cache/bazel/_bazel_vmallet/f7bd20dde09ae44071ab1b18d39dacd1/external/bazel_tools/tools/build_defs/repo/http.bzl", line 111, column 45, in _http_archive_impl
        download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://zlib.net/zlib-1.2.11.tar.gz] to /home/vmallet/.cache/bazel/_bazel_vmallet/f7bd20dde09ae44071ab1b18d39dacd1/external/zlib/temp4474975640135408570/zlib-1.2.11.tar.gz: GET returned 404 Not Found

Any insights ?

kenjitoyama commented 2 years ago

Hi @Vincentx15,

We just noticed such error too. This is a separate bug from pybind11's import above and it's related to zlib changing that path (https://zlib.net/zlib-1.2.11.tar.gz).

Changing the WORKSPACE file isn't working because it is com_google_protobuf (v3.9.0) that's loading that before it gets to our http_archive() in our WORKSPACE. I'll push a change to manually patch it to 1.2.12 (and in the process also update EnvLogger's own version) which should fix this bazel error. The real fix for all this is to update com_google_protobuf but some of our integration tests with TFDS had to use that old version 😞. @sabelaraga and I will re-check the current state to see what we can do.

I have manually tested it locally and everything worked fine (including python -c "import acme; import envlogger") by using dm-launchpad-nightly[tensorflow] which incorporates the same pybind11 fix mentioned above. You may wanna try it that way to see if it works for you. (To create local wheels for envlogger so that you can install it in your system, you can do python envlogger/setup.py bdist_wheel --plat manylinux2014_x86_64 and then python -m pip install dist/envlogger-1.0.7-cp37-cp37m-manylinux2014_x86_64.whl). Once launchpad and acme release new wheels, we can build wheels for EnvLogger and then everything should be working at head.

Cheers,

Daniel

bougui505 commented 2 years ago

Hello, As suggested above, I attempted to fix the double import of envlogger and acme using the following approach:

1 Creation of a local wheel: python envlogger/setup.py bdist_wheel --plat manylinux2014_x86_64 2 Install launchpad: pip install dm-launchpad-nightly[tensorflow] 3 Install envlogger from the wheel: python -m pip install source/envlogger/dist/envlogger-1.0.7-cp37-cp37m-manylinux2014_x86_64.whl

But still the same error remains:

acme2 ❯ python -c "import acme; import envlogger"                                                                                                                                                acme-install 245ms 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/bougui/anaconda3/envs/acme2/lib/python3.7/site-packages/envlogger/__init__.py", line 18, in <module>
    from envlogger import environment_logger
  File "/home/bougui/anaconda3/envs/acme2/lib/python3.7/site-packages/envlogger/environment_logger.py", line 32, in <module>
    from envlogger.backends import riegeli_backend_writer
  File "/home/bougui/anaconda3/envs/acme2/lib/python3.7/site-packages/envlogger/backends/riegeli_backend_writer.py", line 24, in <module>
    from envlogger.backends.python import riegeli_dataset_writer
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined
sabelaraga commented 2 years ago

I'm trying to repro and will post the result I get.

In the meantime (and depending on what you're using envlogger for), since the problem seems to be in the riegeli backend, one workaround is to try to avoid importing the full envlogger library and use a different envlogger writer (like the TFDS writer, example here and colab. It requires to comment the imports and some lines that instantiate the riegeli writer in here and compile a new wheel with the modifications...

sabelaraga commented 2 years ago

Ah, the problem is that acme still installs a specific version of launchpad (https://github.com/deepmind/acme/blob/master/setup.py#L53). Can you try by uninstalling that one and re-installing the nightly?

sabelaraga commented 2 years ago

BTW, I was able to repro (installing an envlogger wheel created with the last version of the github repo, installing acme, removing launchpad and reinstalling launchpad nightly). We are trying to figure out how to solve it.

kenjitoyama commented 2 years ago

Hello everyone,

I looked into this today and found that the main issue is that pybind11_abseil uses C++'s typeid(absl::Status) to check whether the module has already been imported. Its return value (of type std::type_info) is implementation defined and may even be resolved at compile-time in some cases.

If I bypass that check (e.g. by passing false to pybind11::google::ImportStatusModule()), then I'm able to import envlogger after importing courier (the problematic library in launchpad). That is, envlogger can workaround the problem by not re-registering the absl bindings, but if it's the other way around, then it'll fail. We'd have to do the same thing in all the other libraries. However, if all of them pass false to that call, then theabsl bindings will NOT be registered, and the name will be undefined. This is super problematic because it would require careful ordering of imports to avoid errors.

We could sync all pybind11 versions of these related projects (acme, envlogger, launchpad etc), but I strongly believe this is the wrong way to go. This would of course introduce tight coupling between these projects, but it would also not guarantee that other problems could occur when using shared libraries.

In light of this, I'll work on removing pybind11_abseil from EnvLogger. I'll create a custom type to signal errors or even just raise an exception directly in our pybind11 modules. This gives us more control over the interaction with other libraries and reduces our API surface area. Python users of the EnvLogger class will not feel any change since we use pybind11_abseil internally only.

This change may take a little while to take place since there are users who call RiegeliDatasetWriter and RiegeliDatasetReader directly and consume absl::Status[Or] (in Python) and I need to update them first.

Cheers,

Daniel

sabelaraga commented 2 years ago

Version 1.0.8 with these changes is already available in pypi.

sabelaraga commented 2 years ago

I've run the following:

$ pip install envlogger
$ sudo apt-get install libgmp-dev
$ pip install dm-acme
$ pip uninstall dm-launchpad
$ pip install dm-launchpad-nightly[tensorflow]

Now these two work:

$ python3 -c "import acme; import envlogger"
$ python3 -c "import envlogger; import acme"

I'll follow up with Launchpad/Acme to see if they can release a new stable version (so the uninstalling/installing doesn't have to be done), but otherwise, it seems fixed.

Vincent, thanks for reporting and let us know if this still doesn't work for you.

Thanks!

Vincentx15 commented 2 years ago

Worked like a charm, thanks a lot for the support !

Have a good day :)

Vincentx15 commented 2 years ago

Thanks to this fix, I am now using envlogger with tfds backend and it looks like it works well.

I would just comment that I had struggles in using the DatasetConfig class necessary to initialize a tfds backend. Indeed in the main readme, the example explicitly builds it by hand, which is not very flexible (and the info is already in the env). To get it from dm-env specs, I had to play around a bit with tf DTYPES. I wrote this small helper code to get the DatasetConfig from the dm-env. It might be useful to someone else. I don't feel like making a PR, because I don't know where to put this, if you want I can open another issue.

def dmenvs_to_tfds_specs(env_spec):
    """
    This function is needed to go from the specs of dm-env to the ones in rlds
    :param env_spec: either take a dict (like the one we use for observations) or an Array
    :return: The appropriate tfds specification to give to a tfds backend writer
    """

    def dmspec_to_tfdsspec(spec):
        dtype = tf.dtypes.as_dtype(spec.dtype)
        new_spec = tfds.features.Tensor(shape=spec.shape, dtype=dtype)
        return new_spec

    try:
        updated_dict = {}
        for key, value in env_spec.items():
            updated_dict[key] = dmspec_to_tfdsspec(value)
        return updated_dict
    except AttributeError:
        return dmspec_to_tfdsspec(env_spec)

env = wrappers.GymWrapper(env)
environment_spec = specs.make_environment_spec(env)
dataset_config = tfds.rlds.rlds_base.DatasetConfig(
    name='experts_em',
    observation_info=dmenvs_to_tfds_specs(environment_spec.observations))

Best, Vincent

sabelaraga commented 2 years ago

Hey Vincent, yes, it may be tricky to infer the Features, specially if your env spec includes something "fancy" like an Image (also note that if any of the specs is a scalar or a nested value dmspec_to_tfdsspec may fail). I might be able to add a generic tool to https://github.com/google-research/rlds in the upcoming weeks, keep tuned :)

Vincentx15 commented 2 years ago

My little function will work with spaces derived from gym.spaces.Dict and gym.spaces.Box that I believe to be the most frequent inputs. But for sure it is not well written and tested, it's hard to get into the dm rl framework... Thanks for adding that util !

Thanks for your reply and help !

Best, Vincent

sabelaraga commented 2 years ago

Hey Vincent,

I recently submitted this funcitonality https://github.com/google-research/rlds/blob/main/rlds/tfds/config_generator.py

It's not yet released into the pip package, but the code is there :)

Sabela.

Vincentx15 commented 2 years ago

Neat ! Thanks for the support :)