orca3 / MiniAutoML

Source code for "Enginneering Deep Learning Platforms"
http://mng.bz/GGgN
45 stars 11 forks source link

(suggestion) will be good to support Macbook M1 (ARM) env for the lab #37

Open larrycai opened 2 years ago

larrycai commented 2 years ago

Lab env

Most of Macbook is M1 based now, so some container image could be tricky (x86 based) like pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime

Also

larrycai commented 2 years ago

Env

Macbook Air M1, and I use minikube to have docker env (inside qemu), but meet limux/amd64 issue for pytorch

Problem

Problem 1: pytorch/pytorch

$ brew install minikube
$ minikube start --driver qemu
$ minikube ssh
# enter into qemu to have docker env
$ ./build-images-locally.sh 
Step 1/10 : FROM pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime
 ---> 3850639cdf7a
Step 2/10 : RUN pip3 install minio protobuf~=3.20.0 grpcio torch-model-archiver
 ---> [Warning] The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
 ---> Running in e639c32ba232
exec /bin/sh: exec format error

Problem 2: pytorch/torchserve

same as before, pytorch/torchserve doesn't have arm64 container image

Problem 3: config/torch_server_config.properties

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/home/docker/MiniAutoML/scripts/config/torch_server_config.properties" to rootfs at "/home/model-server/config.properties": mount /home/docker/MiniAutoML/scripts/config/torch_server_config.properties:/home/model-server/config.properties (via /proc/self/fd/6), flags: 0x5000: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

Patched solution

Now I made the ./lab-001-start-all.sh works by several patches (not completed for whole exercises)

Patch 1: use kumatea/pytorch:1.9.0 arm64 based container image instead (not formal)

FROM pytorch:1.9.0-cuda10.2-cudnn7-runtime
=> 
FROM kumatea/pytorch:1.9.0

Patch 2: build local pytorch/torchserve:latest-cpu

Download https://github.com/pytorch/serve and change docker/Dockerfile

torch==$TORCH_VER+cpu
=>
torch==$TORCH_VER

after build, it will generate pytorch/torchserve:latest-cpu, it can replace pytorch/torchserve:0.5.3-cpu

Patch 3: touch $(pwd)/config/torch_server_config.properties

Not sure whether it is related with other patch, without this, it is created as folder

Suggestion

It may be generated dedicated arm based container image for this lab for those patches (Patch 1 is not maintained by original author)

Reference

larrycai commented 2 years ago

lab container env

Let's see how to make lab script can be run in container env

Make packages inside container, see Dockerfile

FROM continuumio/miniconda3:4.12.0
RUN apt-get update && apt-get install -y curl jq && apt-get clean
# conda & python3 & pip are ok
# minio mc
RUN curl -O https://dl.min.io/client/mc/release/linux-arm64/mc --output-dir /usr/local/bin && \
    chmod +x /usr/local/bin/mc
# grpcurl
RUN curl -L -O https://github.com/fullstorydev/grpcurl/releases/download/v1.8.7/grpcurl_1.8.7_linux_arm64.tar.gz && \
    tar -xvzf grpcurl_1.8.7_linux_arm64.tar.gz && chmod +x grpcurl && \
    mv grpcurl /usr/local/bin/grpcurl

simply to build as image docker build -t orca3-lab .

lab-002 works

Now we can run it sharing host network

$ docker run -it -v $(pwd):/lab -w /lab --network=host orca3-lab
(base) root@minikube:/lab# ./lab-002-upload-data.sh 
..
Creating intent dataset
{
  "dataset_id": "1",
  "name": "tweet_emotion",
  "dataset_type": "TEXT_INTENT",
  "last_updated_at": "2022-10-02T06:04:48.411844Z",
  "commits": [
    {
      "dataset_id": "1",
      "commit_id": "1",
      "created_at": "2022-10-02T06:04:49.262442Z",
      "commit_message": "Initial commit",
      "path": "dataset/1/commit/1",
      "statistics": {
        "numExamples": "2963",
        "numLabels": "3"
      }
    }
  ]
}

lab-003 has issues

at the same env

(base) root@minikube:/lab# ./lab-003-first-training.sh 1
dataset_id is 1
version_hash is "hashAg=="
job_id is 4
job 4 is currently in unknown status, check back in 5 seconds
job 4 is currently in "launch" status, check back in 5 seconds
ERROR:
  Code: NotFound
  Message: Run 4 doesn't exist
ERROR:
  Code: NotFound
  Message: Cannot locate model artifact for runId 4.

check log

$ docker logs training-service
..
06:25:00.416 [grpc-default-executor-8] INFO  org.orca3.miniAutoML.ServiceBase - Method: training.TrainingService/GetTrainingStatus, Response: status: failure
job_id: 5
message: "Exit code 1"
metadata {
  algorithm: "intent-classification"
  dataset_id: "1"
  name: "test1"
  train_data_version_hash: "hashAg=="
  parameters {
    key: "BATCH_SIZE"
    value: "64"
  }
  parameters {
    key: "EPOCHS"
    value: "15"
  }
  parameters {
    key: "FC_SIZE"
    value: "128"
  }
  parameters {
    key: "LR"
    value: "4"
  }
  output_model_name: "twitter-model"
}

and another log (runId could be different since it is copied in different time)

$ docker logs prediction-service
07:39:49.225 [grpc-default-executor-14] INFO  org.orca3.miniAutoML.ServiceBase - Method: prediction.PredictionService/Predict, Message: runId: "11"
document: "You can have a certain #arrogance, and I think that\'s fine, but what you should never lose is the #respect for the others."

07:39:49.230 [grpc-default-executor-14] ERROR org.orca3.miniAutoML.prediction.PredictionService - Cannot locate model artifact for runId 11.
io.grpc.StatusRuntimeException: NOT_FOUND: Artifact with runId 11 doesn't exist
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
        at org.orca3.miniAutoML.metadataStore.MetadataStoreServiceGrpc$MetadataStoreServiceBlockingStub.getArtifact(MetadataStoreServiceGrpc.java:456)
        at org.orca3.miniAutoML.prediction.PredictionService.predict(PredictionService.java:62)
        at org.orca3.miniAutoML.prediction.PredictionServiceGrpc$MethodHandlers.invoke(PredictionServiceGrpc.java:204)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

checked the minio

(base) root@minikube:/lab# source env-vars.sh 
(base) root@minikube:/lab# mc alias -q set myminio http://127.0.0.1:"${MINIO_PORT}" "${MINIO_ROOT_USER}" "${MINIO_ROOT_PASSWORD}" 
(base) root@minikube:/lab# mc find  myminio/mini-automl-dm
myminio/mini-automl-dm/dataset/1/commit/1/examples.csv
myminio/mini-automl-dm/dataset/1/commit/1/labels.csv
myminio/mini-automl-dm/upload/tweet_emotion_part1.csv
myminio/mini-automl-dm/upload/tweet_emotion_part2.csv
myminio/mini-automl-dm/versionedDatasets/1/hashAg==/examples.csv
myminio/mini-automl-dm/versionedDatasets/1/hashAg==/labels.csv
dszeto commented 2 years ago

Thank you for your suggestions! Were you able to run the lab using images from Docker Hub?

larrycai commented 2 years ago

Don't know u have docker image provided directly, it was not stated in the book when I reviewed.

From tag point of view, there is no ARM64 exists, so it mostly will not work for Mac M1/M2

dszeto commented 2 years ago

Did you run into any text (either in the book, or in README) that mention running scripts/build-images-locally.sh? Wanted to make sure we are pointing our readers to try the stock images before trying to build locally.

I tried the full lab on my Apple M1. All containers worked fine under QEMU. It would be great to see if you would share the same success.

Hopefully all dependent container images will soon have their respective official arm64 versions.

larrycai commented 2 years ago

even using qemu, it is still ARM64 based, so if the base images has no arm64 version, I can't build it correctly.

If I remember correctly, I followed the guideline either book or README here and recorded all my findings.

if u have recent changes in the script, I can check again.

// I use minikube to have docker env, normally I use podman for container env. but both uses qemu

dszeto commented 1 year ago

Agree that scripts/build-images-locally.sh will not finish properly right now on Apple M1 hardware. We are hoping that arm64 base images become available soon so we don't need to maintain two separate ways to build images when needed to.

It looks like both the book and README starts with scripts/lab-001-start-all.sh. That's why I am wondering if there's anywhere in the material that suggested building images locally was necessary.

It would be great if you can try the lab again starting with scripts/lab-001-start-all.sh from a clean environment (clear any existing locally built Docker images from the lab so that it will pull stock images clean from Docker Hub), and see if that works.

Greatly appreciate your effort in trying to build these containers locally on Apple M1! Let me know if you are interested in helping to keep track of the release of arm64 base images.

larrycai commented 1 year ago

No, it doesn't work

$ brew install minikube
$ minikube start --driver qemu # I delete/start with clean env
$ minikube ssh
# enter into qemu to have docker env
(minikube) $ git clone https://github.com/orca3/MiniAutoML.git
(minikube) $ cd MiniAutoML
(minikube) $ scripts/lab-001-start-all.sh
Created docker network orca3
Unable to find image 'minio/minio:latest' locally
latest: Pulling from minio/minio
..
Status: Downloaded newer image for minio/minio:latest
b0bb0b8bf7577b958f69d2ee8eda8445418ff55fa9fe133cfd04f74f606e6f0d
Started minio docker container and listen on port 9000
..
Unable to find image 'orca3/services:latest' locally
latest: Pulling from orca3/services
...
Digest: sha256:4a70f0992171b55278ea58254df4c67a8674e27f04fae8b6a6fc6b5b45936659
Status: Downloaded newer image for orca3/services:latest
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
f190824fc5cd89e9790a8001636467305d98c668bae13573bfe6930de16fa359
Started data-management docker container and listen on port 6000
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
fb89a4d8cbd52cd3857037ae5b2838930a68086d23ebe6c2d9f120799b657502
Started metadata-store docker container and listen on port 6002
rm: cannot remove 'model_cache': No such file or directory
Unable to find image 'orca3/intent-classification-predictor:latest' locally
latest: Pulling from orca3/intent-classification-predictor
...
Digest: sha256:af2df5fd32e9488888c9e6d9a16f8a3e0f510436b94b41989f1d186e493929ad
Status: Downloaded newer image for orca3/intent-classification-predictor:latest
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
4414363ba68cf3109a20a7c671151594303e537663da3c78c54ae8be2d59ebb2
Started intent-classification-predictor docker container and listen on port 6101
Started intent-classification-predictor docker container and listen on port 6101
Unable to find image 'pytorch/torchserve:0.5.2-cpu' locally
0.5.2-cpu: Pulling from pytorch/torchserve
284055322776: Pull complete 
bf7640766b3b: Pull complete 
d05665a60e73: Pull complete 
85824628b9b8: Pull complete 
d93240f6b9fe: Pull complete 
4f4fb700ef54: Pull complete 
Digest: sha256:52ce3f86274bc92aec7a73702358323724097a75cea6d60ac39cd5f445bf727e
Status: Downloaded newer image for pytorch/torchserve:0.5.2-cpu
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
3acb0394a243b672bd83e5fef63c6ce6dc8e736e3bd958e62248cc5df8ca03de
Started intent-classification-torch-predictor docker container and listen on port 6102 & 6103
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
ce2aeced028e784985e7a4df89298d858e611e9c4f8187e3a6455ce1f6667a4d
Started prediction-service docker container and listen on port 6001
latest: Pulling from orca3/intent-classification
...
Digest: sha256:386920bcf1bb81b82c37fe223547a46346fa890893c060f99c2091b971e9d1a3
Status: Downloaded newer image for orca3/intent-classification:latest
docker.io/orca3/intent-classification:latest
pull intent-classification training image
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
c38b84a7784ea0efdfb8e7cec05d520637db4b27c3a628809ca748e9e3e2fae1
Started training-service docker container and listen on port 6003
...

So

You need to find a way to build arm docker image

My thinking

When u have new updates, I can always help to verify, I like your book and this lab materials

dszeto commented 1 year ago

Thanks for running through step 1. Can you try the subsequent steps as well? I also run on Apple M1 and was able to finish all steps in the lab using just Docker Desktop with linux/amd64 base images. It would be great if you can try running the rest of the scripts, just to make sure the lab works on Apple M1 with emulation.

We can tackle building a set of images for linux/arm64 as a separate effort.

Thanks for liking the book and the lab! :) This is the best thing authors would like to hear. Feel free to provide any other feedback or suggestions.

larrycai commented 1 year ago

Sorry, I don't have docker desktop env. I guess it shall work for emulation env (amd64). I will wait and test ARM64 when it is available.

dszeto commented 1 year ago

What version of Docker are you using? It probably doesn't require Docker Desktop. Here's the one I'm using.

❯ docker version
Client:
 Cloud integration: v1.0.29
 Version:           20.10.20
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        9fdeb9c
 Built:             Tue Oct 18 18:20:35 2022
 OS/Arch:           darwin/arm64
 Context:           desktop-linux
 Experimental:      true

Server: Docker Desktop 4.13.0 (89412)
 Engine:
  Version:          20.10.20
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       03df974
  Built:            Tue Oct 18 18:18:16 2022
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0