sakamomo554101 commented 2 years ago

対応内容

16 から派生。

VertexAIのパイプラインを用いて、学習パイプラインを構築する
ドキュメント整理（学習時の手順）

sakamomo554101 commented 2 years ago

model_pipelineのテストがさくっと動かせないのはちと問題だな・・

sakamomo554101 commented 2 years ago

Trainer内部で、csvからDataFrameを渡した際の挙動がおかしい？

sakamomo554101 commented 2 years ago

下記が悪さしてそうだなぁ・・。テストコードとの差分を見るか。

★dataset.py

train_dataset = pd.read_csv(args.train_data_path, index_col=0)
val_dataset = pd.read_csv(args.val_data_path, index_col=0)
test_dataset = pd.read_csv(args.test_data_path, index_col=0)

sakamomo554101 commented 2 years ago

trainerでテスト動かしたら、普通にテストこけてるんだが笑

 > [test_runner 3/3] RUN python3 -m unittest:                                                                                                                                                                                 
#12 8.100 /component/tests/test_main.py:19: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.                 
#12 8.100   return yaml.load(f)                                                                                                                                                                                               
#12 8.100 /component/src/trainer.py:73: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.                     
#12 8.100   parameters = yaml.load(args.parameters)                                                                                                                                                                           
#12 8.100 EE
#12 8.102 ======================================================================
#12 8.102 ERROR: test_main (tests.test_main.TestTrainer)
#12 8.102 ----------------------------------------------------------------------
#12 8.102 Traceback (most recent call last):
#12 8.102   File "/component/tests/test_main.py", line 37, in test_main
#12 8.102     model = main(args=self.args)
#12 8.102   File "/component/src/trainer.py", line 73, in main
#12 8.102     parameters = yaml.load(args.parameters)
#12 8.102   File "/usr/local/lib/python3.8/site-packages/yaml/__init__.py", line 112, in load
#12 8.102     loader = Loader(stream)
#12 8.102   File "/usr/local/lib/python3.8/site-packages/yaml/loader.py", line 24, in __init__
#12 8.102     Reader.__init__(self, stream)
#12 8.102   File "/usr/local/lib/python3.8/site-packages/yaml/reader.py", line 85, in __init__
#12 8.102     self.determine_encoding()
#12 8.102   File "/usr/local/lib/python3.8/site-packages/yaml/reader.py", line 124, in determine_encoding
#12 8.102     self.update_raw()
#12 8.102   File "/usr/local/lib/python3.8/site-packages/yaml/reader.py", line 178, in update_raw
#12 8.102     data = self.stream.read(size)
#12 8.102 AttributeError: 'dict' object has no attribute 'read'
#12 8.102 
#12 8.102 ======================================================================
#12 8.102 ERROR: test_main (tests.test_main.TestTrainer)
#12 8.102 ----------------------------------------------------------------------
#12 8.102 Traceback (most recent call last):
#12 8.102   File "/component/tests/test_main.py", line 34, in tearDown
#12 8.102     shutil.rmtree(self.out_dest.trained_model_dir)
#12 8.102   File "/usr/local/lib/python3.8/shutil.py", line 709, in rmtree
#12 8.102     onerror(os.lstat, path, sys.exc_info())
#12 8.102   File "/usr/local/lib/python3.8/shutil.py", line 707, in rmtree
#12 8.102     orig_st = os.lstat(path)
#12 8.102 FileNotFoundError: [Errno 2] No such file or directory: '/component/tests/output'
#12 8.102 
#12 8.102 ----------------------------------------------------------------------
#12 8.102 Ran 1 test in 0.008s
#12 8.102 
#12 8.102 FAILED (errors=2)
------

sakamomo554101 commented 2 years ago

あ、テストでも同じエラーになっとる

sakamomo554101 commented 2 years ago

修正して、再度パイプラインで処理を実行中

sakamomo554101 commented 2 years ago

下記のエラーが謎すぎる・・

fuse: writeMessage: no such file or directory [16 0 0 0 218 255 255 255 188 1 0 0 0 0 0 0]

sakamomo554101 commented 2 years ago

下記みると、メモリ不足になってる？

The replica workerpool0-0 ran out-of-memory and exited with a non-zero status of 137(SIGKILL). To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=749925056555&resource=ml_job%2Fjob_id%2F1796995624948727808&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.label

sakamomo554101 commented 2 years ago

マシンタイプはe2-standard-4で実行されてそう

sakamomo554101 commented 2 years ago

https://gcpinstances.doit-intl.com/ 上記でGCPのマシンタイプのリストが見れる。 16Gibだとちょっと足らんかも。

sakamomo554101 commented 2 years ago

https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline お、メモリリミットとか設定できそう？

sakamomo554101 commented 2 years ago

https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.Sidecar.set_memory_request 多分、set_memory_requestが良さそう。（リミットだと上限だしなぁ）

sakamomo554101 commented 2 years ago

おぉ、set_memory_limitで32Gにしたら、インスタンスタイプがe2-highmem-4に変わった！

sakamomo554101 commented 2 years ago

GPU使う場合は、add_node_selector_constraintを設定すれば良さそう。 key-valueっぽいが、どう設定すれば良いか？

sakamomo554101 commented 2 years ago

https://cloud.google.com/vertex-ai/docs/training/configure-compute#specifying_gpus 上記がGPUの種類。

以下のように設定するみたい。

add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-k80')

sakamomo554101 commented 2 years ago

へー、以下のように高いメモリインスタンスを使いたい！という指定もできるっぽい。

https://techblog.zozo.com/entry/aip-pipelines-impl

add_node_selector_constraint('cloud.google.com/gke-nodepool', 'high-memory-pool')

sakamomo554101 commented 2 years ago

NVIDIA_TESLA_K80を使ってみるか

sakamomo554101 commented 2 years ago

n1-highmem-8になってる（NVIDIA_TESLA_K80を指定したら）

sakamomo554101 commented 2 years ago

あれ、GPU使われてなさそう。

GPU available: False, used: False

sakamomo554101 commented 2 years ago

set_gpu_limitもいれてみる

sakamomo554101 commented 2 years ago

うーん、GPU available: Falseになるなぁ。（set_gpu_limit(1)を入れても）

sakamomo554101 commented 2 years ago

学習の進捗がわからん・・。ローカルで実行すると、プログレスがみれるが、VertexAIのログ上だと、わからん・・。 TFBoardとかに残したい。

sakamomo554101 commented 2 years ago

課題点

GPUが使えない
学習の進捗が見えない

sakamomo554101 commented 2 years ago

https://cloud.google.com/vertex-ai/docs/training/code-requirements#tensorboard 学習の可視化は上記見れば良いかも。

sakamomo554101 commented 2 years ago

https://cloud.google.com/blog/ja/topics/developers-practitioners/pytorch-google-cloud-how-train-pytorch-models-ai-platform うーん、pytorchのDocker Imageが問題？

sakamomo554101 commented 2 years ago

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch nvidiaのpytorchイメージを取ればいいかなぁ

sakamomo554101 commented 2 years ago

https://www.kubeflow.org/docs/distributions/gke/pipelines/enable-gpu-and-tpu/#configure-containerop-to-consume-gpus 上記でGPU使えそうなんだけどなぁ

sakamomo554101 commented 2 years ago

下記のエラーで、Nvidia Driverが入ってなさそう。

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

sakamomo554101 commented 2 years ago

https://cloud.google.com/blog/ja/topics/developers-practitioners/scalable-ml-workflows-using-pytorch-kubeflow-pipelines-and-vertex-pipelines

sakamomo554101 commented 2 years ago

https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/pytorch_cifar10_vertex_pipelines.ipynb https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/Dockerfile-gpu#L15

上記見ると、Dockerfileのベースイメージがpytorch/pytorchじゃないんだよなぁ。

sakamomo554101 commented 2 years ago

モデル保存でこけてた

2021-11-04T05:45:27.256672631ZTraceback (most recent call last):
エラー
2021-11-04T05:45:27.256699791Z File "/component/src/trainer.py", line 168, in <module>
エラー
2021-11-04T05:45:27.256706251Z model.save(
エラー
2021-11-04T05:45:27.256710241ZTypeError: save() got an unexpected keyword argument 'model'

sakamomo554101 commented 2 years ago

GPUが動くはずのコンテナにしたら、下記エラーでジョブがお亡くなりに・・。なぜに・・

"The replica workerpool0-0 exited with a non-zero status of 139(SIGSEGV). Termination reason: Error

sigsegか・・

sakamomo554101 commented 2 years ago

https://cloud.google.com/ai-platform/training/docs/understanding-training-service#common-errors

sakamomo554101 commented 2 years ago

https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch 上記を使ってみるか。

これだと、GCPベースになってしまう気がするが、とりあえずということで・・

sakamomo554101 commented 2 years ago

以下のエラーに変わったが、、以前詳細わからん・・。ただ、GPUを二つにしてることで、重複して学習処理が走ってる。一旦GPUを1つに制限して、試す。

The replica workerpool0-0 exited with a non-zero status of 1. Termination reason: Error

sakamomo554101 commented 2 years ago

うーむ、VertexAIのログがなんかリアルタイムじゃない印象・・。まとめてログが一気に表示されるので、なんとも処理状況が追いづらい・・

sakamomo554101 commented 2 years ago

ログ見ると、とりあえずGPUは使うようになってるので、Docker Imageは問題ないかもしれない。

sakamomo554101 commented 2 years ago

43 でPR立てた

sakamomo554101 commented 2 years ago

set_gpu_limit(1)にしても、下記エラー。

The replica workerpool0-0 exited with a non-zero status of 1

sakamomo554101 commented 2 years ago

試しにメモリを64Gにあげて、試す・・。しかし、32GBも使うはずないんだが、、

sakamomo554101 commented 2 years ago

同じエラーか・・。どうすりゃいいんだ・・

sakamomo554101 commented 2 years ago

VertexAI Workbench（Notebook）を使って、1行ずつデバッグするのがいいんかねぇ・・

sakamomo554101 commented 2 years ago

まさかストレージが足りないとかあるのかな・・

sakamomo554101 commented 2 years ago

https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.Sidecar.set_ephemeral_storage_limit ストレージのサイズの上限を設けるのは上記でできるかも？

sakamomo554101 commented 2 years ago

pytorch/pytorchのImageにしたら、以下のエラー。

2021-11-06T14:32:33.709158406ZTraceback (most recent call last):
エラー
2021-11-06T14:32:33.709192739Z File "/component/src/trainer.py", line 183, in <module>
エラー
2021-11-06T14:32:33.709202956Z model = main(artifacts.component_arguments)
エラー
2021-11-06T14:32:33.709211825Z File "/component/src/trainer.py", line 79, in main
エラー
2021-11-06T14:32:33.709218218Z test_dataset=test_dataset,
エラー
2021-11-06T14:32:33.709227122Z File "/component/src/trainer.py", line 113, in train
エラー
2021-11-06T14:32:33.709232830Z trainer = pl.Trainer(**train_params)
エラー
2021-11-06T14:32:33.709238288Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
エラー
2021-11-06T14:32:33.709244386Z return fn(self, **kwargs)
エラー
2021-11-06T14:32:33.709249998Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 448, in __init__
エラー
2021-11-06T14:32:33.709256238Z plugins,
エラー
2021-11-06T14:32:33.709261921Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 181, in __init__
エラー
2021-11-06T14:32:33.709268Z self.accelerator = self.select_accelerator()
エラー
2021-11-06T14:32:33.709273433Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 803, in select_accelerator
エラー
2021-11-06T14:32:33.709279652Z accelerator = acc_cls(training_type_plugin=self.training_type_plugin, precision_plugin=self.precision_plugin)
エラー
2021-11-06T14:32:33.709285574Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 399, in precision_plugin
エラー
2021-11-06T14:32:33.709291123Z self._precision_plugin = self.select_precision_plugin()
エラー
2021-11-06T14:32:33.709297268Z File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 614, in select_precision_plugin
エラー
2021-11-06T14:32:33.709317182Z f"You have asked for `amp_level={self.amp_level!r}` but it's only supported with `amp_backend='apex'`."
エラー
2021-11-06T14:32:33.709328874Zpytorch_lightning.utilities.exceptions.MisconfigurationException: You have asked for `amp_level='O1'` but it's only supported with `amp_backend='apex'`.
エラー
2021-11-06T14:32:48.105185817ZThe replica workerpool0-0 exited with a non-zero status of 1. Termination reason: Error.

sakamomo554101 commented 2 years ago

下記エラーが出てる。oomっぽい。

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.43 GiB total capacity; 6.46 GiB already allocated; 54.81 MiB free; 6.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

sakamomo554101 commented 2 years ago

http://fruitsoflife.sblo.jp/article/189004809.html https://pytorch.org/docs/stable/notes/cuda.html

sakamomo554101 commented 2 years ago

試しにPYTORCH_NO_CUDA_MEMORY_CACHING=1を入れてみるか。 GPU上のメモリ（RAM）のキャッシュによりOOMになってそうなので。

sakamomo554101 commented 2 years ago

https://cloud.google.com/compute/docs/gpus#other_available_nvidia_gpu_models 上記見ると、P4は8GiBか。やや足らんかも。

sakamomo554101 commented 2 years ago

お、P100にしたら、処理通ったな！モデルの保存処理でこけてるけど・・。

だいぶ早いな〜

sakamomo554101 / YouyakuAI

VertexAIのパイプラインに学習パイプラインを構築する #24

対応内容

16 から派生。

課題点

43 でPR立てた