Closed jaspock closed 2 years ago
Hi,
I am not entirely sure why this happens, but let me take a stab. It is most likely related to the --ipaddr flag and the line 884 in pretrain_nmt.py which is "os.environ['MASTER_PORT'] = '26023'".
It is possible that the default argument of --ipaddr as localhost may be an issue with docker. Or it might be the case that 26023 is a bad port which is already in use. Basically, it seems like the process is waiting for something. So playing with this may help.
Other than that I can suggest that you try outside a docker environment.
Hope this helps.
This issue seemed to be related to some incompatibilities between my CUDA and the versions of Tensorflow and/or Pytorch in requirements.txt
. I have it working now using Python 3.6.8, Pytorch 1.10.1 and TensorFlow 2.4.3.
Just in case this is useful to someone else, this is the relevant part of my current Dockerfile
:
FROM nvcr.io/nvidia/pytorch:20.12-py3
RUN apt-get update
RUN apt-get install -y wget tmux && rm -rf /var/lib/apt/lists/*
WORKDIR /setup
WORKDIR /app
RUN conda update conda
RUN conda create -n yanmtt python=3.6.8
SHELL ["conda", "run", "-n", "yanmtt", "/bin/bash", "-c"]
RUN git clone https://github.com/prajdabre/yanmtt
WORKDIR yanmtt
RUN pip install -r requirements.txt
WORKDIR transformers
RUN python setup.py install
RUN pip install tensorflow==2.4.3
SHELL ["/bin/bash", "-c"]
ENV PYTHONPATH=$PYTHONPATH:/app/yanmtt/transformers
RUN conda install -n yanmtt pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
WORKDIR /setup
RUN git clone --branch v0.1.95 https://github.com/google/sentencepiece.git
RUN mkdir sentencepiece/build
WORKDIR sentencepiece/build
RUN cmake .. && make -j 4
RUN make install && ldconfig -v
RUN echo 'eval "$(conda shell.bash hook)"' >>~/.bashrc && echo 'conda activate yanmtt' >>~/.bashrc
WORKDIR /app
Oh fantastic. Could you make a contrib folder in the examples folder and write down these points and then send a pull request? It would really help people.
I run
bash examples/create_tokenizer.sh
and thenbash examples/create_tokenizer.sh
, but the latter showsand then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:
I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is
torch
, as the version inrequirements.txt
is too old for my GPU.