GPU support for Mac M1?

alix-pham commented 1 year ago

Hi :)

I know the README says GPU speed can only be used with NVIDIA GPUs (Linux and Windows), I was wondering if you were planning on providing support for Mac M1 GPUs in the future?

I found this article, would it work? (I guess not because you use CUDA and CUDA cannot work with M1?)

Thank you very much in advance!

Best, Alix

kevinjohncutler commented 1 year ago

@alix-pham I will definitely get Apple Silicon GPU support working. It should be as simple as changing the device type from cuda in several places, probably not more than a few hours of work once I get the chance. In my testing so far, the much more difficult thing is getting a conda environment installed that has all the dependencies built for arm64... so releasing an environment is also on my to-do list.

alix-pham commented 1 year ago

Thank you very much @kevinjohncutler! Looking forward to it.

kevinjohncutler commented 1 year ago

@alix-pham I got it working on my M2 Macbook Air. About 2-4x speed improvement on the built-in test images in the GUI. Note that this is for evaluation only, I have not tested training yet. Most of the changes are in the cellpose backend, but pulling the omnipose changes and running pip install -e . should fetch the up to date cellpose changes as well. I need to post an environment file for apple silicon installs, but if you are able to help test, that would be great.

alix-pham commented 1 year ago

Thank you so much @kevinjohncutler! I would say I am not familiar enough with those things to help you test it, unfortunately... I am not even sure I understand what I should do to try GPU support on my Mac with the information you provided... 😬 I updated omnipose using pip install git+https://github.com/kevinjohncutler/omnipose.git but when running the segmentation I get that output (the same as usual)

2022-10-20 11:04:18,530 [INFO] TORCH GPU version not installed/working.
>>> GPU activated? 0
2022-10-20 11:04:18,531 [INFO] >>bact_phase_omni<< model set to be used
2022-10-20 11:04:18,532 [INFO] >>>> using CPU

Is there something else I should do? Where should I run pip install -e .? I did not clone the repo on my computer (and as I am not sure what I'm supposed to do with it, I didn't run it yet).

Thanks in advance!

PS: Congrats on your Nature Methods paper 🥳

kevinjohncutler commented 1 year ago

Thanks @alix-pham! I see, I figured since you were requesting mac GPU support you knew what a world of pain you were getting yourself into haha. I may finally have time this weekend to put together a conda environment and installation instructions to make it relatively painless. It's possible that the macOS GUI executable will also 'just work', but I need to compile a new version. The issue you are running into is just to do with the dependencies, but the conda environment will aim to solve that.

alix-pham commented 1 year ago

No, sorry! I only want the process to be faster, as we are processing big movies, and using the GPU should make it faster. Though I am not using the GUI because I'm complementing the segmentation with a tracking pipeline; I figured it would be easier that way. Thank you very much!

mccruz07 commented 1 year ago

Hi @kevinjohncutler,

Thanks for all your work! I can confirm the GPU support for Mac M1/2 with your modifications, but unfortunately I'm receiving an Attribute error while trying to train a new model.

I tested on M1 mac: Python 3.10.4

and Windows 11 Python = 3.8.4 pytorch = 1.11.0 cudatoolkit = 11.3.1

Here is the report:

> python -m omnipose --train --use_gpu --dir ./Documents/omni --mask_filter _masks --n_epochs 100 --pretrained_model None --learning_rate 0.1 --diameter 0 --batch_size 16 --RAdam
!NEW LOGGING SETUP! To see cellpose progress, set --verbose
No --verbose => no progress or info printed
2022-11-16 16:02:24,512 [INFO] ** TORCH GPU version installed and working. **
2022-11-16 16:02:24,512 [INFO] >>>> using GPU
Omnipose enabled. See Omnipose repo for licencing details.
2022-11-16 16:02:24,512 [INFO] Training omni model. Setting nclasses=4, RAdam=True
2022-11-16 16:02:24,514 [INFO] not all flows are present, will run flow generation for all images
2022-11-16 16:02:24,515 [INFO] training from scratch
2022-11-16 16:02:24,515 [INFO] median diameter set to 0 => no rescaling during training
2022-11-16 16:02:24,601 [INFO] No precomuting flows with Omnipose. Computed during training.
2022-11-16 16:02:24,608 [INFO] >>> Using RAdam optimizer
2022-11-16 16:02:24,608 [INFO] >>>> training network with 2 channel input <<<<
2022-11-16 16:02:24,608 [INFO] >>>> LR: 0.10000, batch_size: 16, weight_decay: 0.00001
2022-11-16 16:02:24,608 [INFO] >>>> ntrain = 2
2022-11-16 16:02:24,608 [INFO] >>>> nimg_per_epoch = 2
/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py:1105: UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1668586478573/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  denom = torch.multiply(torch.linalg.norm(x,dim=1),torch.linalg.norm(y,dim=1))+eps
2022-11-16 16:02:33,471 [INFO] Epoch 0, Time  8.9s, Loss 4.7680, LR 0.1000
2022-11-16 16:02:34,080 [INFO] saving network parameters to /Users/mcruz/Documents/omni/models/cellpose_residual_on_style_on_concatenation_off_omni_nclasses_4_omni_2022_11_16_16_02_24.602007
Traceback (most recent call last):
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/omnipose/__main__.py", line 3, in <module>
    main(omni_CLI=True)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/__main__.py", line 476, in main
    cpmodel_path = model.train(images, labels, train_files=image_names,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/models.py", line 1045, in train
    model_path = self._train_net(train_data, train_labels,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py", line 1057, in _train_net
    self.net.save_model(file_name)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1504, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'save_model'

Best, Mario

kevinjohncutler commented 1 year ago

@mccruz07 Thanks for the report! I have not tried training on apple silicon yet, but it looks like that might be a simple fix. I'll look into it in the next week.

kevinjohncutler commented 1 year ago

@mccruz07 Turns out all GPU training was broken due to a recent change I made to fix a bug for CPU training. I fixed it now in cellpose-omni v0.7.3. I will test it on an M2 mac in the next couple days, but let me know if you get a chance to test it earlier.

mccruz07 commented 1 year ago

@kevinjohncutler Thank you! But know I'm reciving the follow error:

/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py:1111: UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications.

kevinjohncutler commented 1 year ago

Update: with torch 1.13.1, training is working on Apple Silicon =D @mccruz07 I am still getting that warning, but no errors. Really rough benchmark on a small (5-image) dataset: Titan RTX takes 34.1s for the first 100 epochs, M2 GPU takes 95.7s, so about 2.8x slower. For reference, CPU training on my Ubuntu machine with a core i9 9900K is 161.3s (1.7x slower than M2 GPU) and M2 CPU takes 277.6s (2.9x slower than M2 GPU).

I'm thinking about getting a Mac Studio to have much more VRAM than any consumer NVIDIA card can offer... @michaels10, what config do you have?

michaels10 commented 1 year ago

Hi -- I have Mac M1 Ultra; that said, I just tried GPU and it doesn't seem to be working for me; I get a rather uneventful 2023-01-12 17:59:42,705 [INFO] TORCH GPU version not installed/working. error message.

I used pip install -e . on the cloned repo. I'm running python -m cellpose --train --pretrained_model bact_omni --use_gpu --chan 0 --dir "/Users/michaelsandler/Documents/experiments/omnipose-test/training-data" --n_epochs 100 --learning_rate 0.1 --verbose command to run things. Pytorch version 1.12, which I believe in their versioning scheme is greater than 1.4?

Minimal set to reproduce same as in the other bug.

kevinjohncutler commented 1 year ago

Thanks @michaels10, good to know. 64 or 128GB of RAM? In addition to the cellose_omni bug which I hope is now fixed for you (it should now download v0.8.0 or higher), it's pobably that your conda environment is not set up for pytorch on M1 - you are right, torch 1.13.1 is what you want. I will update this repo with an environment file for Macs.

kevinjohncutler commented 1 year ago

Ok, try out omnipose_mac_environment.yml. I installed it with

conda env create --name omnipose --file /Volumes/DataDrive/omnipose_mac_environment.yml
conda activate omnipose
pip install git+https://github.com/kevinjohncutler/omnipose.git
pip install git+https://github.com/kevinjohncutler/cellpose-omni.git

To my amazement, it worked the first time around. However, I have some notes from my first attempts getting this to work months back, and it is possible that some dependencies actually need to be compiled from source and my conda environment is just using the right versions from the base environment... we shall see once more people try this out.

michaels10 commented 1 year ago

Tried it, alas to no avail. ** PyTorch is version 1.13.1 -- also, my computer is the 128GB model.

The only potentially relevant warning I get is:

/Users/michaelsandler/opt/anaconda3/envs/omnipose/lib/python3.9/runpy.py:127: RuntimeWarning: 'cellpose_omni.__main__' found in sys.modules after import of package 'cellpose_omni', but prior to execution of 'cellpose_omni.__main__'; this may result in unpredictable behaviour

**Weirdly, someone in our lab with the exact same setup is having no problems. Maybe something is wrong with my base environment.

kevinjohncutler commented 1 year ago

Interesting. I'm not sure what to make of that error, but I know from practice that one way to be totally sure your environment is disjoint from base is by specifying a different version of python. Omnipose works on every version of python I've tried so far (3.8.5+), so if your base is on 3.9, maybe you should try 3.10.8.

michaels10 commented 1 year ago

Update: After some extended sleuthing, I found that this was tied to Rosetta, which I had left enabled... whoops. Works well, didn't even need to set MPS as fallback!

That said, it does spit out the following warning: UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.)

This is obviously an upstream issue, and I already get substantial speedup (5x), so I'm pretty happy with this. Is there anything you'd like me to test?

kevinjohncutler commented 1 year ago

@michaels10 very nice! If you are using my environment file, I actually do set the evironment variable PYTORCH_ENABLE_MPS_FALLBACK: '1'. I use torch.linalg.norm in 4 places, and maybe there is a workaround for the default 2-norm (just square and square root explicitly) to avoid that CPU fallback without waiting for another pytorch release.

Thanks for offering! Have you tried any training so far, or just evaluation of existing models?

kevinjohncutler commented 1 year ago

Update on GPU performance: got my hands on a Mac Studio (M1 Ultra, 128GB) and it took 115.3s for the same 100 epoch test I ran earlier. I ran it again to make sure, and got 97.8s, 93.1s, and then 93.8s. Not sure what explains the speed difference (njit compilation, perhaps).

I implemented a workaround for the vector_norm function like I said, and it did speed things up a bit. Times were 79.1s, 69.4s, and 67.7s for three trials. So this is roughly 1/2 as fast as the Titan RTX, but with over 5x the available memory.

Unfortunately, I just found out (while attempting to run a 3D model) that pytorch does not currently support a bunch of basic 3D functions like conv3d. The whole reason I got the Mac Studio was to use Omnipose on memory-intensive 3D volumes. I'll either have to scrap that plan or implement those functions myself.

For evaluation, I should note that it takes my Titan RTX about 0.4s on my default GUI image and 0.6s on the M1, 2/3 as fast.

su2804 commented 1 year ago

Hi @kevinjohncutler, using Apple Silicon with the same config as @michaels10 but on using the omnipose_mac_environment.yml to create environment, get the following error

(base) omnipose % conda env create --name omnipose_sil --file omnipose_mac_environment.yml Collecting package metadata (repodata.json): done Solving environment: failed

ResolvePackageNotFound:

sqlite==3.39.3=h2229b38_0
brotli==1.0.9=h1c322ee_7
nb_conda_kernels==2.3.1=py39h2804cbe_1
libllvm11==11.1.0=hfa12f05_4
liblapack==3.9.0=16_osxarm64_openblas
brotli-bin==1.0.9=h1c322ee_7
h5py==3.7.0=nompi_py39h6b51346_101
libzlib==1.2.12=h03a7124_3
jupyter_core==4.11.1=py39h2804cbe_0
tensorflow-base==2.10.0=cpu_py39h0d4f425_0
markupsafe==2.1.1=py39hb18efdd_1
giflib==5.2.1=h27ca646_2
bzip2==1.0.8=h3422bc3_4
libtiff==4.4.0=hfa0b094_4
cffi==1.15.1=py39h04d3946_0
icu==70.1=h6b3803e_0
zstd==1.5.2=h8128057_4
libpng==1.6.38=h76d750c_0
libzopfli==1.0.3=h9f76cd9_0
flatbuffers==2.0.7=hb7217d7_0
re2==2022.06.01=h9a09cb3_0
zlib-ng==2.0.6=he4db4b2_0
libwebp-base==1.2.4=h57fd34a_0
python==3.9.13=hc596b02_0_cpython
c-blosc2==2.4.2=h303ed30_0
libgfortran==5.0.0=11_3_0_hd922786_25
libaec==1.0.6=hbdafb3b_0
lcms2==2.12=had6a04f_0
libcxx==14.0.6=h2692d47_0
lz4-c==1.9.3=hbdafb3b_1
dav1d==1.0.0=he4db4b2_1
cfitsio==4.1.0=hd4f5c17_0
charls==2.3.4=hbdafb3b_0
llvm-openmp==14.0.4=hd125106_0
libffi==3.4.2=h3422bc3_5
ca-certificates==2022.9.24=h4653dfc_0
tensorflow-estimator==2.10.0=cpu_py39h63f9d84_0
fastremap==1.13.3=py39h4aae847_0
c-ares==1.18.1=h3422bc3_0
libavif==0.10.1=h3d80962_2
tornado==6.2=py39h9eb174b_0
aiohttp==3.8.3=py39h02fc5c5_0
libbrotlidec==1.0.9=h1c322ee_7
jpeg==9e=he4db4b2_2
openjpeg==2.5.0=h5d4e404_1
ncurses==6.3=h07bb92c_1
libnghttp2==1.47.0=h232270b_1
libsodium==1.0.18=h27ca646_1
imagecodecs==2022.9.26=py39h6bc43d6_0
click==8.1.3=py39h2804cbe_0
libprotobuf==3.21.7=hb5ab8b9_0
blosc==1.21.1=hd414afc_3
cryptography==38.0.1=py39haa0b8cc_0
pyzmq==24.0.1=py39h0553236_0
tensorflow==2.10.0=cpu_py39h2839aeb_0
importlib-metadata==4.11.4=py39h2804cbe_0
zeromq==4.3.4=hbdafb3b_1
openssl==1.1.1q=ha287fd2_0
libcblas==3.9.0=16_osxarm64_openblas
libopenblas==0.3.21=openmp_hc731615_3
libev==4.33=h642e427_1
libgfortran5==11.3.0=hdaf2cc0_25
lerc==4.0.0=h9a09cb3_0
brunsli==0.1=h9f76cd9_0
xz==5.2.6=h57fd34a_0
numba==0.56.2=py39h251cc7c_1
libbrotlicommon==1.0.9=h1c322ee_7
grpc-cpp==1.47.1=h503f348_6
zlib==1.2.12=h03a7124_3
tensorboard-data-server==0.6.0=py39hbe5e4b8_2
brotlipy==0.7.0=py39hb18efdd_1004
libcurl==7.85.0=hd538317_0
libdeflate==1.14=h1a8c8d9_0
libbrotlienc==1.0.9=h1c322ee_7
wrapt==1.14.1=py39h9eb174b_0
aom==3.5.0=h7ea286d_0
readline==8.1.2=h46ed386_0
libabseil==20220623.0=cxx17_h28b99d4_4
zfp==1.0.0=h7b19444_1
hdf5==1.12.2=nompi_h8968d4b_100
snappy==1.1.9=h39c3846_1
frozenlist==1.3.1=py39h4eb3d34_0
jxrlib==1.1=h27ca646_2
libsqlite==3.39.3=h76d750c_0
grpcio==1.47.1=py39h13431ec_6
libssh2==1.10.0=hb80f160_3
libblas==3.9.0=16_osxarm64_openblas
multidict==6.0.2=py39hb18efdd_1
krb5==1.19.3=hf9b2bbe_0
libedit==3.1.20191231=hc8eb9b7_2
tk==8.6.12=he1e0b03_0 Have tried the everything listed here thus far, to no avail. Have you seen this before?

su2804 commented 1 year ago

Quick update: I solved the issue above by simply using a prefix (indicating the arm64 architecture) with the conda create environment command: CONDA_SUBDIR=osx-arm64 conda env create --name omnipose_sil --file omnipose_mac_environment.yml and voila, everything starts to work. For the first time, I see the magical words: 2023-06-15 17:11:08,636 [INFO] TORCH GPU version installed and working. However, having issue with the training using the following command: python -m omnipose --train --pretrained_model None --use_gpu --chan 0 --dir /Users/saranshumale/Documents/Data/Asymmetry/April28MM/Cell1/BF_copy/ --n_epochs 100 --learning_rate 0.1

Error: !NEW LOGGING SETUP! To see cellpose progress, set --verbose No --verbose => no progress or info printed 2023-06-15 17:11:08,636 [INFO] TORCH GPU version installed and working. 2023-06-15 17:11:08,636 [INFO] >>>> using GPU Traceback (most recent call last): File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/saranshumale/Documents/omnipose/omnipose/main.py", line 3, in main(omni_CLI=True) File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/site-packages/cellpose_omni/main.py", line 254, in main if args.nchan>1: TypeError: '>' not supported between instances of 'NoneType' and 'int'

kevinjohncutler commented 8 months ago

@su2804 Sorry I never saw there was activity on this thread. Are you still experiencing that training issue?

kevinjohncutler / omnipose

GPU support for Mac M1? #14