Closed bazzmx closed 5 years ago
Hi @bazzmx, I think this might be related to https://github.com/stanfordnlp/stanfordnlp/issues/18. In short, we made a recent fix to the POS tagger that makes it run faster by ~10 times. However, this fix is not on PyPI
yet. Can you try re-installing the latest master branch from source (https://github.com/stanfordnlp/stanfordnlp#setup) and see if it gives you enough speedup?
I installed that version just a moment ago, when processing text using just one line break like on the first image I get assertion error when fed line by line.
If I understand correctly, the suggested method is to merge everything in one file with two line breaks \n\n like the second image? and feed it directly as in
with codecs.open(filename, encoding="utf8") as f:
text_catala = f.read()
doc = NLP(text_catala)
Then stanfordnlp starts to process the text, it seems like there's only one cpu working.
So I don't know which way could be better, if I should just wait and see how long it takes or keep feeding line by line?
Anyway, I don't know what I'm missing here. I hope this is useful, thanks in advance.
@bazzmx when you tried GPU did you look at output log ?
When I tried to activate GPU with following config both on my local and google colab jupiternotebook GPU wasn't used.
config = {
'use_gpu': True,
'processors': 'tokenize,pos,lemma,mwt,depparse', #depparse
}
output log:
Use device: cpu
---
Loading: tokenize
The documentation says that it selects gpu automatically if available, I'm explicitly selecting to use cpu for the pipeline I posted the screenshots from, but I just did test selecting gpu and it does not use gpu.
>>> nlp = stanfordnlp.Pipeline(lang='ca', use_gpu=True)
Use device: cpu
---
Loading: tokenize
With settings:
{'model_path': '/users/atorres/stanfordnlp_resources/ca_ancora_models/ca_ancora_tokenizer.pt', 'lang': 'ca', 'shorthand': 'ca_ancora', 'mode': 'predict'}
---
Now, in my case right now I can't tell if it's not using it because of a bug or because there are too many processes using the gpu (more users on the same server running several things), but a few days ago when I was running my firsts pipelines it was working fine in catalan, spanish and french.
I should add some findings. There is a setting in Google colap to activate GPU. It works:
Use device: gpu
---
Loading: tokenize
Other problem that I could not use GPU on my local is because although I can see GPU usage on some process, m graphic card is Intel and it does not have CUDA support. And if I am not wrong PyTorch only supports CUDA.
And if I am not wrong PyTorch only supports CUDA.
That is correct, pytorch only supports cuda devices, when a cuda device is available torch will use it, otherwise it will run considerably slower on the cpu.
By the way, my pipeline ran considerably faster after a while, I just let the process run for about a day and everything as tokenized and tagged, so, there's that. Maybe there were some issuing regarding memory handling?
@meghabyte @bazzmx It is definitely the case that only CUDA will be supported via PyTorch. So if you your GPU does not support CUDA, then I won't expect it to work.
@bazzmx Did you solve your problem of tokenizer/tagger running too slowly after installing the latest version from source?
I could say that yes, because it took less than 24 hours to tokenize and post tag filter wikipedia in catalan and then train a word2vec model, I'm guessing it was a combination of the old implementation and the other processes that were running on the server. My earlier tries always broke or stuck on tokenizing over a day.
Now, regarding the use of gpu, I'm not really sure, right now someone is using the gpu for a tensorflow process, and I'm guessing that's the reason stanfordnlp doesn't use the gpu for the pipeline.
Good to know! I am closing this issue for now, but feel free to comment if you have further issues.
I'm running into the opposite issue. Out of memory errors on an 11GB RAM 1080ti.
raceback (most recent call last):
File "/home/henrye/projects/wp_neural_pipeline/modules/parse_tripadvisor_with_stanfordnlp.py", line 31, in <module>
main()
File "/home/henrye/projects/wp_neural_pipeline/modules/parse_tripadvisor_with_stanfordnlp.py", line 24, in main
parsed_review = nlp(review)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 125, in __call__
self.process(doc)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 119, in process
self.processors[processor_name].process(doc)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/pipeline/depparse_processor.py", line 22, in process
preds += self.trainer.predict(b)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 72, in predict
_, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/henrye/anaconda3/envs/pytorch/lib/python3.6/site-packages/stanfordnlp/models/depparse/model.py", line 190, in forward
preds.append(deprel_scores.max(3)[1].detach().cpu().numpy())
RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 10.91 GiB total capacity; 9.17 GiB already allocated; 1.01 GiB free; 140.70 MiB cached)
Well, the last line states the following: RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 10.91 GiB total capacity; 9.17 GiB already allocated; 1.01 GiB free; 140.70 MiB cached)
You are trying to allocate more data than it can handle, it seems that you have 1.01gb free and the pipeline tries to allocate 1.33 which might be causing the error.
Try again in smaller batches and see if the error still pops up, this might slow the process a bit, but it help avoid this issue.
Sure, I get that it's an OOM error. How do I run with smaller batches?
Hi @Henry-E, sorry that there is a memory issue. The neural model in StanfordNLP indeed requires a large amount of CUDA memory, mainly because of the large size of the word embeddings used. We will work on reducing the memory requirement in future releases.
Are you running a dependency parser? If so, you can set the batch size by adding a depparse_batch_size
argument to your config before running the parser (which by default is 5000). See here for how to set the config, and see here for more details on the parser batch size. Equivalently you can set the batch size of other processors such as the POS tagger, etc.
Ok awesome thanks for the pointers
I'm running a simple pipeline of tokenization and pos using a 600mb text file in catalan as input. Stanfordnlp automatically runs 24 processes and it's processing about 1mb every 10 minutes or so.
I tried to change pos_batch_size (from 10000 to 100000, then from 200000 to 20000, etc.) and tokenize_batch_size (32, 64, 128, then back), but it seems that I'm hitting a bottle neck, because increasing the batch_size makes the process slower.
How can I change the number of processes to run?
My system configuration is as follows:
Architecture: x86_64 Mode(s) opératoire(s) des processeurs :32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) par cœur : 2 Cœur(s) par socket : 12 Socket(s): 2 Nœud(s) NUMA : 2 Identifiant constructeur :GenuineIntel Famille de processeur :6 Modèle : 79 Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Révision : 1 Vitesse du processeur en MHz :2500.007 CPU max MHz: 2900,0000 CPU min MHz: 1200,0000 BogoMIPS: 4400.80 Virtualisation : VT-x Cache L1d : 32K Cache L1i : 32K Cache L2 : 256K Cache L3 : 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
I tried using GPU but it is slower and adjusting the batch_size did not improve the processing time.
NVIDIA-SMI 375.39 Driver Version: 375.39
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
0 Tesla M40 Off | 0000:04:00.0 Off | 0 N/A 40C P0 63W / 250W | 434MiB / 11443MiB | 0% Default
Processes: GPU Memory GPU PID Type Process name Usage
0 172198 C python 107MiB 0 172526 C python 107MiB 0 186922 C python 107MiB 0 187828 C python 107MiB
I'm using Python 3.6.8 over an anaconda environment.