googleapis / python-bigquery

Apache License 2.0
746 stars 307 forks source link

Importing bigquery disables python garbage collector #1826

Closed mwkracht closed 9 months ago

mwkracht commented 9 months ago

Importing the big query library is disabling the python garbage collector. This causes excessive memory usage by the python process which imports this library and can lead to eventual OOM for longer running processes. It makes using this package with a long running process impossible.

I would say disabling the gc seems like a bad practice in general but that's maybe more of a personal judgement. At the very least doing so in an import with no warning or documentation (I found no references when googling this package and gc disable) can lead to very misleading and hard to debug memory issues.

Environment details

# uname -a
Linux f372c8ce385b 6.4.16-linuxkit #1 SMP PREEMPT Sat Sep 23 13:36:48 UTC 2023 x86_64 GNU/Linux

# python --version
Python 3.9.2

# pip --version
pip 24.0 from /usr/local/lib/python3.9/dist-packages/pip (python 3.9)

# pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 3.17.2
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /usr/local/lib/python3.9/dist-packages
Requires: google-api-core, google-cloud-core, google-resumable-media, packaging, python-dateutil, requests

Not sure if this is useful but here are the other google packages/dependencies I'm running alongside of bigquery:

# pip freeze | grep google
google-api-core==2.17.1
google-api-python-client==2.119.0
google-auth==2.28.1
google-auth-httplib2==0.2.0
google-cloud-access-context-manager==0.2.0
google-cloud-appengine-logging==1.4.1
google-cloud-asset==3.24.1
google-cloud-audit-log==0.2.5
google-cloud-bigquery==3.17.2
google-cloud-bigquery-storage==2.24.0
google-cloud-bigtable==2.23.0
google-cloud-compute==1.16.1
google-cloud-container==2.40.0
google-cloud-core==2.4.1
google-cloud-logging==3.9.0
google-cloud-monitoring==2.19.1
google-cloud-org-policy==1.10.0
google-cloud-os-config==1.17.1
google-cloud-pubsub==2.19.4
google-cloud-resource-manager==1.12.1
google-cloud-storage==2.14.0
google-cloud-trace==1.13.1
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.13.0

Steps to reproduce

  1. Import bigquery lib
  2. run any python code long enough and you're guaranteed to OOM

Code example

# ipython
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import gc

In [2]: gc.isenabled()
Out[2]: True

In [3]: from google.cloud import bigquery

In [4]: gc.isenabled()
Out[4]: False

Stack trace

n/a

Linchin commented 9 months ago

Thank you @mwkracht for reporting the issue! It's very strange, because we don't explicitly disable garbage collector in our code, and I'm unable to reproduce the issue locally for both python 3.9 and python 3.12 (although I'm using micromamba as the package installer, but still).

$ python
Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> gc.isenabled()
True
>>> from google.cloud import bigquery
>>> gc.isenabled()
True

My guess is some package we depend on is disabling garbage collector. To help me better investigate the issue, could you let me know:

  1. How did you install python?
  2. When did this issue start happening? Has it been recently, or it's always been like this?

(Marking as P2 because it got 6 upvotes in an hour.)

mwkracht commented 9 months ago

@Linchin thanks for the quick reply. I think I've discovered the combination of installs to reproduce. The dep install sequence is definitely screwy:

pip install pyarrow==6.0.1
pip uninstall -y numpy
pip install google-cloud-bigquery==3.17.2

After doing this I think you should be able to see the gc become disabled when you redo the import steps from above. I acknowledge the uninstallation of numpy after it's installed as a required dependency of pyarrow is screwy. Possibly makes all of this a moot point.

For completeness sake I'll try to explain why it would ever make sense to uninstall numpy when it's a required dependency of pyarrow :)

We have a dependency that requires numpy - we don't use any part of that dependency that actually consumes numpy but the dependency doesn't make numpy optional so we're kinda stuck. We also have dependencies that do not explicitly require numpy but will check the import at runtime and use it if it's there (I think you guys have something like that in your _version_helpers.py).

There's this race condition bug in numpy related to how they interact with environment vars that we hit with threaded code (https://github.com/numpy/numpy/issues/21223). So if we have numpy installed the dependencies that don't explicitly require numpy will use it because it's there and hit this race condition. To avoid that we have uninstall numpy :)

It just so happens then that pyarrow explicitly requires numpy (we don't explicitly require/import pyarrow, I just narrowed it down to that from a pip freeze of all our deps). So somewhere in the dependency tree we're installing pyarrow, pyarrow thinks all is well because numpy was installed, then we uninstall numpy. Then this bigquery package will use pyarrow if it's available and somewhere in that chain the gc is getting turned off.

We are probably to blame for deleting numpy but that fall back behavior for a missing dependency is not ideal. The solution is to not have pyarrow installed which will force bq to not use pyarrow and not hit whatever is causing the gc to be disabled. We can do that either by an explicit uninstall or by avoiding whatever dependencies are requiring them in the first place.

Possibly there's nothing to do here other than have it documented in case anyone else is running into this issue. It would be cool if the fallback behavior wasn't to silently turn the gc off but idk if that's something you all have control over or if that's a pyarrow thing.

Lmk if there's anything else you need from me. I'll leave it up to you to decide whether or not there's something to be improved upon here or not re: this package.

Linchin commented 9 months ago

@mwkracht I tried the same thing (but with micromamba) and garbage collector is still enabled XD

Also, thanks for the very detailed explanation. I guess your code is also special that it's running on multiple threads. Is it possible to avoid the race condition?

Just like you said, this problem seems to be out of our library's scope. Forcing garbage collector to be on when importing bigquery feels very hacky, and may bring other complexities. I will close this issue for now, but please let us know if you have any more questions or suggestions.

mwkracht commented 9 months ago

That's wild you can't also recreate it. Fwiw I did a little more digging and it looks like the gc issue is caused by pyarrow and not bigquery. Dropping some notes here in case the bread crumbs help anyone else.

I can recreate it using the stock python 3.9 docker image:

❯ docker run --rm -it --entrypoint bash python:3.9

root@3a8dde739ce9:/# pip install pyarrow==6.0.1
...
Successfully installed numpy-1.26.4 pyarrow-6.0.1

root@3a8dde739ce9:/# pip uninstall -y numpy
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4

root@3a8dde739ce9:/# pip install google-cloud-bigquery==3.17.2
...
Successfully installed cachetools-5.3.2 certifi-2024.2.2 charset-normalizer-3.3.2 google-api-core-2.17.1 google-auth-2.28.1 google-cloud-bigquery-3.17.2 google-cloud-core-2.4.1 google-crc32c-1.5.0 google-resumable-media-2.7.0 googleapis-common-protos-1.62.0 idna-3.6 packaging-23.2 protobuf-4.25.3 pyasn1-0.5.1 pyasn1-modules-0.3.0 python-dateutil-2.8.2 requests-2.31.0 rsa-4.9 six-1.16.0 urllib3-2.2.1

root@3a8dde739ce9:/# python
Python 3.9.18 (main, Feb 13 2024, 09:58:52)
[GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> gc.isenabled()
True
>>> from google.cloud import bigquery
>>> gc.isenabled()
False
>>> exit()
root@3a8dde739ce9:/#

If I try to import pyarrow after deleting numpy I do get a ModuleNotFound error on the import:

❯ docker run --rm -it --entrypoint bash python:3.9
root@d07c3fe327a9:/# pip install pyarrow==6.0.1
...
Installing collected packages: numpy, pyarrow
Successfully installed numpy-1.26.4 pyarrow-6.0.1

root@d07c3fe327a9:/# pip uninstall -y numpy
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4

root@d07c3fe327a9:/# python
Python 3.9.18 (main, Feb 13 2024, 09:58:52)
[GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pyarrow/__init__.py", line 63, in <module>
    import pyarrow.lib as _lib
  File "pyarrow/lib.pyx", line 24, in init pyarrow.lib
ModuleNotFoundError: No module named 'numpy'
>>> issubclass(ModuleNotFoundError, ImportError)
True

Worth noting a ModuleNotFoundError is a subclass of ImportError so I would expect anywhere bigquery is importing pyarrow for it to act as an ImportError if I've deleted numpy. If I'm looking at your version helpers it does correct show that it considers a pyarrow version to not be found (running on stock py3.9 docker image):

>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pyarrow/__init__.py", line 63, in <module>
    import pyarrow.lib as _lib
  File "pyarrow/lib.pyx", line 24, in init pyarrow.lib
ModuleNotFoundError: No module named 'numpy'
>>> bigquery._versions_helpers.PYARROW_VERSIONS.try_import() is None
True

All of that looks like what I would expect to see. When I tested the effect on the garbage collector it's the pyarrow import after numpy is deleted that turns the gc off (running on stock py3.9 docker image):

>>> import gc;gc.isenabled()
True
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pyarrow/__init__.py", line 63, in <module>
    import pyarrow.lib as _lib
  File "pyarrow/lib.pyx", line 24, in init pyarrow.lib
ModuleNotFoundError: No module named 'numpy'
>>> gc.isenabled()
False

So pyarrow is raising the correct error for bq to consider it not found/importable but it is also turning off the garbage collector which may or may not be what the pyarrow devs want/expect 😅 but it's at least for sure a pyarrow behavior and not something the bigquery package is incorrectly handling or causing.