tunib-ai / parallelformers

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
https://tunib-ai.github.io/parallelformers
Apache License 2.0
778 stars 61 forks source link

Issue running parallelformers test script in a VM #23

Open Mehrad0711 opened 2 years ago

Mehrad0711 commented 2 years ago

How to reproduce

First of all, thanks for this great project!

I'm facing an issue running the test code provided here on Kubernetes.

This is what I'm running inside a Kubeflow pod:

python3 tests/seq2seq_lm.py --test-name=test --name=Helsinki-NLP/opus-mt-en-zh --gpu-from=0 --gpu-to=3 --use-pf

I'm using a g4dn.12xlarge AWS machine with four T4 GPUs.

The pod hangs when executing this line until I manually terminate it.

I suspected this change might have been the culprit so I ran the same code with v1.2.4 of parallelformers. This time, the pod quits during execution of the same line without outputting any errors which is odd.

Notably, if I run the same command without --use-pf it runs fine.

I saw you've reported some problems using docker. However, memory should not be an issue here since I'm using Helsinki-NLP/opus-mt-en-zh model which is relatively small.

I was wondering if parallelformers code has ever been tested on Kubernetes? Also would appreciate it if you could look into this issue. Thanks!

Environment

hyunwoongko commented 2 years ago

can you try that in the if __name__ == '__main__' context?