LDA Multi-core not using all cores

davidleo4313 commented 8 years ago

Hi,

My current situation is that, I have a corpus with around 600.000 documents and I already zip it. My environment is an Amazon Linux EC2 c3.2xlarge which have 8 cores (4 real cores I presume). When I tried LdaMulticore with 3 or 7 workers, I only saw at most 2 cores working at 100%. The total hours for the model training is around 16 hours, which is a lot I think.

I have done some researches, and in issues #288 I believe, you have mentioned about slow iteration speed through corpus. But when I checked, It took 15 minutes to read through my corpus. So I don't know if this is because of slow iteration speed or not? I tried shard corpus but the machine memories (15 GB) and hard drive (20GB SSD) could not handle it, either memory error or out of disk space error.

Is there any work around to reduce the running time or am I missing something? Thank you so much in advance. Help is really appreciated.

piskvorky commented 8 years ago

Doesn't sound like iteration speed is the bottleneck then.

Not sure why it would max out at 2 cores... one I could understand (something went wrong with parallelization), but two is a strange number :)

Can you gist your log (DEBUG level) & paste the link here?

davidleo4313 commented 8 years ago

Thanks for your response piskvorky. Could you please help me as how to get the log (DEBUG level)? Sorry I am new to this so if you could help, I will try and post it right away.

On the other hand, I tried to check the BLAS library with numpy.show_config() command and this is the result:

lapack_info: libraries = ['lapack', 'lapack'] library_dirs = ['/usr/lib64'] language = f77 lapack_opt_info: libraries = ['lapack', 'lapack', 'blas', 'blas'] library_dirs = ['/usr/lib64'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] openblas_lapack_info: NOT AVAILABLE blas_info: libraries = ['blas', 'blas'] library_dirs = ['/usr/lib64'] atlas_3_10_blas_threads_info: NOT AVAILABLE atlas_threads_info: NOT AVAILABLE atlas_3_10_threads_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE atlas_3_10_blas_info: NOT AVAILABLE atlas_blas_threads_info: NOT AVAILABLE openblas_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE blas_opt_info: libraries = ['blas', 'blas'] library_dirs = ['/usr/lib64'] define_macros = [('NO_ATLAS_INFO', 1)] atlas_info: NOT AVAILABLE atlas_3_10_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE mkl_info: NOT AVAILABLE

The atlas_thread_info line result is NOT AVAILABLE, is this affect the training model of gensim? From what I read, installing a BLAS library will make the process faster, but I thought the BLAS is already there in the lapack_info?

piskvorky commented 8 years ago

BLAS would make some operations faster, but is not related to lack of parallelism in LdaMulticore.

davidleo4313 commented 8 years ago

Please find the debug log here: https://gist.github.com/davidleo4313/7e8ad26fcf6503a050033a3a40d87353

As for only 2 cores running, below is the top 1 command:

top - 16:57:28 up 14:06, 3 users, load average: 1.41, 1.35, 1.30 Tasks: 154 total, 3 running, 146 sleeping, 5 stopped, 0 zombie Cpu0 : 46.6%us, 0.7%sy, 0.0%ni, 52.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 81.5%us, 11.5%sy, 0.0%ni, 7.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 5.4%us, 4.7%sy, 0.0%ni, 89.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 25.2%us, 9.0%sy, 0.0%ni, 65.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 15403956k total, 9159936k used, 6244020k free, 65380k buffers Swap: 0k total, 0k used, 0k free, 659172k cached

The total CPU% usages never went pass 200%, and most of the time it just show 2 Cpu with 100% usage. Pease note that currently I set worker to 3, however it is the same if I set worker to 7.

davidleo4313 commented 8 years ago

Hi @piskvorky ,

I have attached the debug log & the top 1 command above. Could you please help to clarify the problem? Currently it takes 16 hours to train the model and it's a bit too long. Thanks in advance for your help.

piskvorky commented 8 years ago

Thanks. The timestamps are missing from the log -- can you add %(asctime)s to your log format?

And also %(processName)s while you're at it. Cheers.

piskvorky commented 8 years ago

But looking at the log, the number of features stands out. Do you really need almost 1 million features? What's really excessive for most applications, and will make your models larger (as well as slower to train).

davidleo4313 commented 8 years ago

Thanks piskvorky. I have update the log with what you requested. Please find the link below

https://gist.github.com/davidleo4313/7e8ad26fcf6503a050033a3a40d87353

About features, is it the main reason? Because currently I only remove those token which appear less than 2 in each document, is this the reason why the feature is too large?

piskvorky commented 8 years ago

Yes, the feature set size seems excessive. Unless you have a special reason not to, I'd suggest removing the most frequent and infrequent tokens with filter_extremes, e.g. dictionary.filter_extremes(no_below=10, no_above=0.4, keep_n=5000).

This may be connected to the perceived lack of parallelism too.

From your log, it looks like a communication problem. A worker takes ~30 seconds to process one chunk of 2,000 documents, but only 5s of that is spent on actual LDA! The rest is waiting on input, serializing/deserializing chunks (using multiprocessing queues = pickle), sending results to master (using multiprocessing queues = pickle).

Try a smaller vocabulary, so that the chunks and the model are smaller. That should reduce communication overhead, and hopefully the workers won't be starved for input any more.

Let me know how that worked!

davidleo4313 commented 8 years ago

Hi @piskvorky ,

I have updated the features and the filter_extremes (37k features) in my corpus and dictionary as you suggested. However the result seems not different than before. Here's the link for the new log (please note that I have updated it to 7 workers)

https://gist.github.com/davidleo4313/fe85fc84de0f25ef289784e28384582f

Top 1 shell command linux showed the same results, only 200% CPU usage at max, and there are only 2 - 3 running processes:

top - 15:47:20 up 8:24, 2 users, load average: 1.95, 1.99, 1.98 Tasks: 145 total, 2 running, 142 sleeping, 1 stopped, 0 zombie Cpu0 : 81.4%us, 3.7%sy, 0.0%ni, 15.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 21.3%us, 0.0%sy, 0.0%ni, 78.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 15.4%us, 0.0%sy, 0.0%ni, 83.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.7%st Mem: 15403956k total, 4908188k used, 10495768k free, 62668k buffers Swap: 0k total, 0k used, 0k free, 1909568k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4102 root 20 0 1655m 828m 17m S 110.5 5.5 376:04.91 python 4270 root 20 0 1291m 706m 6452 R 100.1 4.7 53:14.02 python 4272 root 20 0 1264m 679m 6396 S 21.6 4.5 53:13.75 python 3307 mongod 20 0 366m 42m 12m S 0.3 0.3 1:05.68 mongod 4920 ec2-user 20 0 15292 2252 1912 R 0.3 0.0 0:00.02 top 1 root 20 0 19644 2608 2288 S 0.0 0.0 0:00.71 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 rcu_sched 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 9 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 10 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0 13 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2 16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0 17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H 18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3 20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0 21 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H 22 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/4 25 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H 26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5

There must have been a bottle-neck somewhere, where am I missing? I just can't figure it out, your help is really appreciated. Thanks in advance, piskvorky

piskvorky commented 8 years ago

Hi @davidleo4313 , I see the same thing in this log too.

Processing a chunk of 2,000 documents is pretty fast (~5s), but getting that chunk to the worker in the first place is super slow (~30s, the difference between "getting a new job" and "processing chunk" same-worker log lines).

That's very unusual and probably the root cause of the other issues, like the lack of parallelism. So I'd start debugging there.

I really don't know what could be causing this. Can you try extracting one chunk of 2k docs manually, and timing how long it takes to pickle/unpickle it?

I'm thinking maybe it's some python installation peculiarity, a problem with pickle or Queue. @cscorley @gojomo @ziky90 any other ideas?

tmylk commented 8 years ago

Ping @davidleo4313, did manual pickling also take 30s here?

piskvorky / gensim

LDA Multi-core not using all cores #695