pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
668 stars 87 forks source link

custom cuda extensions make installing ao hard #288

Closed msaroufim closed 1 month ago

msaroufim commented 3 months ago

i'm collecting a few issues I've seen, I have no clear picture of how to solve them as of this moment but aggregating them in the hopes that inspiration will strike

Problems

Problem 1

The below issue is solved by installing ao and then cd out of the ao directory. IIRC PyTorch has a similar problem in a repro shared by @jerryzh168

Traceback (most recent call last):
  File "/home/jerryzh/ao/example.py", line 2, in <module>
    from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
  File "/home/jerryzh/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: cannot import name 'C' from partially initialized module 'torchao' (most likely due to a circular import) (/home/jerryzh/ao/torchao/__init_.py)

Problem 2

Another issue here is building the fp6 kernels is failing https://hastebin.com/share/riridivafa.rust but the nvcc and gcc versions seem fine in a repro shared by @CoffeeVampir3

Problem 3

This error shows up when you either pip install ao or build it with a mismatch in cuda versions in a repro shared by @vayuda

python test/quantization/test_quant_api.py
Traceback (most recent call last):
  File "/u/pj8wfq/ao/test/quantization/test_quant_api.py", line 21, in <module>
    from torchao.dtypes import (
  File "/u/pj8wfq/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: /u/pj8wfq/ao/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

Problem 4

pypi binaries are crashing on non CUDA devices

File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Solutions

We need graceful solutions but in the meantime I'm embarassed to say I've been recommending a nuclear option which is to disable C extensions

Specifically in torchao/__init__.py delete

if not _IS_FBCODE:
    from . import _C
    from . import ops

And in setup.py delete

    ext_modules=get_extensions(),
gau-nernst commented 3 months ago

Maybe there should be a flag to skip compiling extension modules and make sure the package can still run without extension module built. (still, it's a stopgap measure, doesn't tackle the root problem)

msaroufim commented 3 months ago

Indeed an env variable doing the nuclear options seems practical although yeah it's gonna be clunky to have to tell people please install us with NO_CPP=True pip install torchao

malfet commented 3 months ago
jerryzh168 commented 3 months ago

another issue similar to Problem 3:

Traceback (most recent call last):
  File "/home/jerryzh/ao/test/quantization/test_quant_api.py", line 21, in <module>
    from torchao.dtypes import (
  File "/home/jerryzh/anaconda3/envs/ao_new/lib/python3.9/site-packages/torchao-0.2.0-py3.9-linux-x86_64.egg/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: /home/jerryzh/anaconda3/envs/ao_new/lib/python3.9/site-packages/torchao-0.2.0-py3.9-linux-x86_64.egg/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit11parseSchemaERKSs
msaroufim commented 1 month ago

These issues were mostly fixed so far, can reopen if more stuff comes up