Issue running parallelformers test script in a VM

How to reproduce

First of all, thanks for this great project!

I'm facing an issue running the test code provided here on Kubernetes.

This is what I'm running inside a Kubeflow pod:

python3 tests/seq2seq_lm.py --test-name=test --name=Helsinki-NLP/opus-mt-en-zh --gpu-from=0 --gpu-to=3 --use-pf

I'm using a g4dn.12xlarge AWS machine with four T4 GPUs.

The pod hangs when executing this line until I manually terminate it.

I suspected this change might have been the culprit so I ran the same code with v1.2.4 of parallelformers. This time, the pod quits during execution of the same line without outputting any errors which is odd.

Notably, if I run the same command without --use-pf it runs fine.

I saw you've reported some problems using docker. However, memory should not be an issue here since I'm using Helsinki-NLP/opus-mt-en-zh model which is relatively small.

I was wondering if parallelformers code has ever been tested on Kubernetes? Also would appreciate it if you could look into this issue. Thanks!

Environment

OS : Linux
Python version : 3.8.3
Transformers version : 4.17.0
Whether to use Docker: Yes
Misc.:
branch: main

tunib-ai / parallelformers

Issue running parallelformers test script in a VM #23

How to reproduce

Environment