How can I run more processes?

bazzmx commented 5 years ago

I'm running a simple pipeline of tokenization and pos using a 600mb text file in catalan as input. Stanfordnlp automatically runs 24 processes and it's processing about 1mb every 10 minutes or so.

I tried to change pos_batch_size (from 10000 to 100000, then from 200000 to 20000, etc.) and tokenize_batch_size (32, 64, 128, then back), but it seems that I'm hitting a bottle neck, because increasing the batch_size makes the process slower.

How can I change the number of processes to run?

My system configuration is as follows:

Architecture: x86_64 Mode(s) opératoire(s) des processeurs :32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) par cœur : 2 Cœur(s) par socket : 12 Socket(s): 2 Nœud(s) NUMA : 2 Identifiant constructeur :GenuineIntel Famille de processeur :6 Modèle : 79 Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Révision : 1 Vitesse du processeur en MHz :2500.007 CPU max MHz: 2900,0000 CPU min MHz: 1200,0000 BogoMIPS: 4400.80 Virtualisation : VT-x Cache L1d : 32K Cache L1i : 32K Cache L2 : 256K Cache L3 : 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

I tried using GPU but it is slower and adjusting the batch_size did not improve the processing time.

NVIDIA-SMI 375.39 Driver Version: 375.39
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.

0 Tesla M40 Off | 0000:04:00.0 Off | 0 N/A 40C P0 63W / 250W | 434MiB / 11443MiB | 0% Default

Processes: GPU Memory GPU PID Type Process name Usage

0 172198 C python 107MiB 0 172526 C python 107MiB 0 186922 C python 107MiB 0 187828 C python 107MiB

I'm using Python 3.6.8 over an anaconda environment.

yuhaozhang commented 5 years ago

Hi @bazzmx, I think this might be related to https://github.com/stanfordnlp/stanfordnlp/issues/18. In short, we made a recent fix to the POS tagger that makes it run faster by ~10 times. However, this fix is not on PyPI yet. Can you try re-installing the latest master branch from source (https://github.com/stanfordnlp/stanfordnlp#setup) and see if it gives you enough speedup?

bazzmx commented 5 years ago

I installed that version just a moment ago, when processing text using just one line break like on the first image I get assertion error when fed line by line.

captura de pantalla 2019-02-23 a las 15 00 10

If I understand correctly, the suggested method is to merge everything in one file with two line breaks \n\n like the second image? and feed it directly as in

with codecs.open(filename, encoding="utf8") as f:
        text_catala = f.read()
        doc = NLP(text_catala)

captura de pantalla 2019-02-23 a las 14 59 48

Then stanfordnlp starts to process the text, it seems like there's only one cpu working.

captura de pantalla 2019-02-23 a las 15 12 22

So I don't know which way could be better, if I should just wait and see how long it takes or keep feeding line by line?

Anyway, I don't know what I'm missing here. I hope this is useful, thanks in advance.

mehmetilker commented 5 years ago

@bazzmx when you tried GPU did you look at output log ?

When I tried to activate GPU with following config both on my local and google colab jupiternotebook GPU wasn't used.

config = {
    'use_gpu': True,
    'processors': 'tokenize,pos,lemma,mwt,depparse', #depparse
}

output log:

Use device: cpu
---
Loading: tokenize

bazzmx commented 5 years ago

The documentation says that it selects gpu automatically if available, I'm explicitly selecting to use cpu for the pipeline I posted the screenshots from, but I just did test selecting gpu and it does not use gpu.

>>> nlp = stanfordnlp.Pipeline(lang='ca', use_gpu=True)                                                                  
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/users/atorres/stanfordnlp_resources/ca_ancora_models/ca_ancora_tokenizer.pt', 'lang': 'ca', 'shorthand': 'ca_ancora', 'mode': 'predict'}
---

Now, in my case right now I can't tell if it's not using it because of a bug or because there are too many processes using the gpu (more users on the same server running several things), but a few days ago when I was running my firsts pipelines it was working fine in catalan, spanish and french.

mehmetilker commented 5 years ago

I should add some findings. There is a setting in Google colap to activate GPU. It works:

Use device: gpu
---
Loading: tokenize

Other problem that I could not use GPU on my local is because although I can see GPU usage on some process, m graphic card is Intel and it does not have CUDA support. And if I am not wrong PyTorch only supports CUDA.

bazzmx commented 5 years ago

And if I am not wrong PyTorch only supports CUDA.

That is correct, pytorch only supports cuda devices, when a cuda device is available torch will use it, otherwise it will run considerably slower on the cpu.

By the way, my pipeline ran considerably faster after a while, I just let the process run for about a day and everything as tokenized and tagged, so, there's that. Maybe there were some issuing regarding memory handling?

yuhaozhang commented 5 years ago

@meghabyte @bazzmx It is definitely the case that only CUDA will be supported via PyTorch. So if you your GPU does not support CUDA, then I won't expect it to work.

@bazzmx Did you solve your problem of tokenizer/tagger running too slowly after installing the latest version from source?

bazzmx commented 5 years ago

I could say that yes, because it took less than 24 hours to tokenize and post tag filter wikipedia in catalan and then train a word2vec model, I'm guessing it was a combination of the old implementation and the other processes that were running on the server. My earlier tries always broke or stuck on tokenizing over a day.

Now, regarding the use of gpu, I'm not really sure, right now someone is using the gpu for a tensorflow process, and I'm guessing that's the reason stanfordnlp doesn't use the gpu for the pipeline.

yuhaozhang commented 5 years ago

Good to know! I am closing this issue for now, but feel free to comment if you have further issues.

Henry-E commented 5 years ago

I'm running into the opposite issue. Out of memory errors on an 11GB RAM 1080ti.

raceback (most recent call last):
  File "/home/henrye/projects/wp_neural_pipeline/modules/parse_tripadvisor_with_stanfordnlp.py", line 31, in <module>
    main()
  File "/home/henrye/projects/wp_neural_pipeline/modules/parse_tripadvisor_with_stanfordnlp.py", line 24, in main
    parsed_review = nlp(review)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 125, in __call__
    self.process(doc)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 119, in process
    self.processors[processor_name].process(doc)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/depparse_processor.py", line 22, in process
    preds += self.trainer.predict(b)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 72, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/models/depparse/model.py", line 190, in forward
    preds.append(deprel_scores.max(3)[1].detach().cpu().numpy())
RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 10.91 GiB total capacity; 9.17 GiB already allocated; 1.01 GiB free; 140.70 MiB cached)

bazzmx commented 5 years ago

Well, the last line states the following: RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 10.91 GiB total capacity; 9.17 GiB already allocated; 1.01 GiB free; 140.70 MiB cached)

You are trying to allocate more data than it can handle, it seems that you have 1.01gb free and the pipeline tries to allocate 1.33 which might be causing the error.

Try again in smaller batches and see if the error still pops up, this might slow the process a bit, but it help avoid this issue.

Henry-E commented 5 years ago

Sure, I get that it's an OOM error. How do I run with smaller batches?

yuhaozhang commented 5 years ago

Hi @Henry-E, sorry that there is a memory issue. The neural model in StanfordNLP indeed requires a large amount of CUDA memory, mainly because of the large size of the word embeddings used. We will work on reducing the memory requirement in future releases.

Are you running a dependency parser? If so, you can set the batch size by adding a depparse_batch_size argument to your config before running the parser (which by default is 5000). See here for how to set the config, and see here for more details on the parser batch size. Equivalently you can set the batch size of other processors such as the POS tagger, etc.

Henry-E commented 5 years ago

Ok awesome thanks for the pointers

stanfordnlp / stanza

How can I run more processes? #38