Closed jgcb00 closed 3 years ago
Hi @jgcb00, thanks a lot for the PR, this looks great!
Could you tell me which pytorch version you are using? right now I'm running into the following problem using pytorch-1.7.1:
Traceback (most recent call last):
File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "run_align.py", line 161, in alignement_recovering
idxs, sent_src, sent_tgt, word_aligns_list = p.recv()
File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 108, in rebuild_cuda_tensor
torch.cuda._lazy_init()
File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Hi, So my current installation is based on torch 1.8.1+cu111 Running on ubuntu 20.04 with basic installation using pip.
This error might be cause by the re-implementation of multiprocessing by torch which caused me some issue to. I'm going to look into it.
Edit : Python 3.8.5
I so I tested on pytorch 1.7.1 with cuda 10.1 and python 3.6.9, the extraction went smooth without any issue. Either on GPU or CPU. Installation on a virtualenv with pip. I will try to test others installation to check, on which one it's working and which doesn't and see if it can be resolve.
Which version of CUDA were you running ?
Edit : python 3.8.5 + torch 1.7 with cuda 11 works fine
I'm using CUDA 10.2 and python 3.8.3. I tried pytorch 1.8.1 but the problem still exists.
It seems that following the suggestions in https://stackoverflow.com/questions/48822463/how-to-use-pytorch-multiprocessing can solve the issue:
from torch.multiprocessing import Pool, Process, SimpleQueue, Pipe, set_start_method
set_start_method('spawn', force=True)
I'm trying to see if we can remove time.sleep
.
Hi, Here is a speed improvement up to 50%, and a requested RAM size of a fixed amount (10GB aprox.)
This merge request change a lot of things in run_align.py
The starting point was to observe that the preprocessing was running, single-threaded, while not using the gpu and increasing the RAM needed proportionally to the size of the input_file.
The main idea in this version is to do in the same time preprocessing and alignment. This improve the memory management because the result are immediately written, using no more memory. And improve the speed because of this parallelization.
1) The memory size is not fixed, this was causing me issue for file with more than 3 millions sentences (needed 32GB of RAM), now it needs only 10GB for every size of file, this is of course a drawback for small size, but you can downsize the ram needed with the new parameters
--nb_preprocess
, I personally recommend at least 2-3 for this option. 2) The time for preprocessing was approximately 50% of the global time, so this time is now remove, improving the speed by this factor.In more details : We have a lot of process : -feed_data : read the data, and put all sentences in a queue for process_encoding -process_encoding : make the preprocessing of the data (several of them) -data_batch : reorder the sentences to get the same output and feed batch to word_align -word_align : run alignement then give all sentences to alignement_recovering -alignement_recovering : this write in the ouput_file the result
word_align and alignement_recovering are split because it's improve the alignement speed by 10%
Here is a benchmark on my computeur (RTX3090, I9-7920X, 32GB 2666Mhz) for 1 millions sentences :
Improvements to be made :
Finally, I add
--word_output
, in order to get a tab separate file with the word and the alignement. It was usefull to my application (building a probabilistic dictionnary)If you have any comment, or idea of improvement please let me know, it's my first ever pull-request on open source repo.