piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.38k forks source link

KeyedVector most_similar() use too much CPU #3335

Open MrKZZ opened 2 years ago

MrKZZ commented 2 years ago

Problem description

When using KeyedVector most_similiar() function, my process is always killed because it costs too much CPU resourse. I have 20 cpu but running with 2200%. And i don't find any where to limit this setting. for example : Can i use only 5 cpus to run this code?

Steps/code/corpus to reproduce

When I run this code :

from gensim.models import KeyedVectors
model = KeyedVectors.load("word2vec")
similar_words = model.most_similar(positive=[query], topn=50)

Note that my server will kill the process which using more than 100% cpu or mem.

Versions

gensim==4.1.2

Please provide the output of: After several mins ... it will be killed. The monitor is as follow:

killed by XXX

image and the part in the red box is my problem. How can i solve these parts, this is monitor by another machine and the problem machine is killed . image

mpenkov commented 2 years ago

Why do you insist on deleting a relevant part of the template?

I'm including it below. Please provide that information.


Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
MrKZZ commented 2 years ago

Sorry for that. Yesterday, I didn't get your point. Here is one of my versions:

Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core Python 3.6.13 | packaged by conda-forge | (default, Sep 23 2021, 07:56:31) [GCC 9.4.0] Bits 64 NumPy 1.19.4 SciPy 1.5.4 gensim 4.1.2 FAST_VERSION 1

And I found this problem can reproduce on many versions. Later I will update others i handled.

piskvorky commented 2 years ago

Does running your processes with OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 fix the problem?

gojomo commented 2 years ago

I suspect those ways of capping the BLAS library's internal multithreading will control the use of threads, at a cost in running time.

But if you're on an idiosyncratic system that kills your process when it (efficiently!) chooses to use a lot of threads, it might be better to remedy that system policy. (And, make sure the process-killing is truly happening for the reason you suspect – lots of CPU use – and not something else – like lots of memory use, as might happen if lots of service-threads are all redundantly loading copies of the same vectors.)