Speed and memory improvement

jgcb00 commented 3 years ago

Hi, Here is a speed improvement up to 50%, and a requested RAM size of a fixed amount (10GB aprox.)

This merge request change a lot of things in run_align.py

The starting point was to observe that the preprocessing was running, single-threaded, while not using the gpu and increasing the RAM needed proportionally to the size of the input_file.

The main idea in this version is to do in the same time preprocessing and alignment. This improve the memory management because the result are immediately written, using no more memory. And improve the speed because of this parallelization.

1) The memory size is not fixed, this was causing me issue for file with more than 3 millions sentences (needed 32GB of RAM), now it needs only 10GB for every size of file, this is of course a drawback for small size, but you can downsize the ram needed with the new parameters --nb_preprocess, I personally recommend at least 2-3 for this option. 2) The time for preprocessing was approximately 50% of the global time, so this time is now remove, improving the speed by this factor.

In more details : We have a lot of process : -feed_data : read the data, and put all sentences in a queue for process_encoding -process_encoding : make the preprocessing of the data (several of them) -data_batch : reorder the sentences to get the same output and feed batch to word_align -word_align : run alignement then give all sentences to alignement_recovering -alignement_recovering : this write in the ouput_file the result

word_align and alignement_recovering are split because it's improve the alignement speed by 10%

Here is a benchmark on my computeur (RTX3090, I9-7920X, 32GB 2666Mhz) for 1 millions sentences :

old run_align : 1h9min
new run_align (5 preprocess) : 37 min

Improvements to be made :

When process are done, they are waiting 15 second because of a strange error, I couldn't resolve, resulting the queue object to be corrupted as soon as one process finish.
On both script my RTX3090 translate at 700 sentences/sec for the first 15 000 before droping performance to 450 on avg. I couldn't find the reason plus the graphic card only run to 70% of its capacity, very strange
If you use the cpu, with the new script, the performance are similar, but they can be improve with a second alignement process because it uses only half of the cpu power.

Finally, I add --word_output, in order to get a tab separate file with the word and the alignement. It was usefull to my application (building a probabilistic dictionnary)

If you have any comment, or idea of improvement please let me know, it's my first ever pull-request on open source repo.

zdou0830 commented 3 years ago

Hi @jgcb00, thanks a lot for the PR, this looks great!

Could you tell me which pytorch version you are using? right now I'm running into the following problem using pytorch-1.7.1:

Traceback (most recent call last):                                                                                                                                                                    
  File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap                                                                                                
    self.run()                                                                                                                                                                                        
  File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run    
    self._target(*self._args, **self._kwargs)                                                                                                                                                         
  File "run_align.py", line 161, in alignement_recovering                                                                                                                                             
    idxs, sent_src, sent_tgt, word_aligns_list = p.recv()                                                                                                                                             
  File "/nas/home/ziyidou/anaconda3/lib/python3.8/multiprocessing/connection.py", line 251, in recv 
    return _ForkingPickler.loads(buf.getbuffer())                                                                                                                                                     
  File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 108, in rebuild_cuda_tensor
    torch.cuda._lazy_init()                                                                                                                                                                           
  File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init                                                                                      
    raise RuntimeError(                                                                                                                                                                               
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

jgcb00 commented 3 years ago

Hi, So my current installation is based on torch 1.8.1+cu111 Running on ubuntu 20.04 with basic installation using pip.

This error might be cause by the re-implementation of multiprocessing by torch which caused me some issue to. I'm going to look into it.

Edit : Python 3.8.5

jgcb00 commented 3 years ago

I so I tested on pytorch 1.7.1 with cuda 10.1 and python 3.6.9, the extraction went smooth without any issue. Either on GPU or CPU. Installation on a virtualenv with pip. I will try to test others installation to check, on which one it's working and which doesn't and see if it can be resolve.

Which version of CUDA were you running ?

Edit : python 3.8.5 + torch 1.7 with cuda 11 works fine

zdou0830 commented 3 years ago

I'm using CUDA 10.2 and python 3.8.3. I tried pytorch 1.8.1 but the problem still exists.

It seems that following the suggestions in https://stackoverflow.com/questions/48822463/how-to-use-pytorch-multiprocessing can solve the issue: from torch.multiprocessing import Pool, Process, SimpleQueue, Pipe, set_start_method set_start_method('spawn', force=True)

I'm trying to see if we can remove time.sleep.

neulab / awesome-align

Speed and memory improvement #19