Closed robguinness closed 5 years ago
Hi @robguinness
Can you increase logging level of lsi_worker to debug value (gensim.models.lsi_worker, line with logging.basicConfig
) and print output here?
Will do ASAP. It might be a day or two before I can get back to this.
Just a small update: I have restarted the modeling process with logging level now set to DEBUG. I'll update when/if it gets stuck again.
I hit the locked up situation again this morning. Here is a log from the LSI worker that is locked up:
2019-01-16 11:39:11,574 - INFO - worker #6 received job #146
2019-01-16 11:39:11,574 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-16 11:39:11,574 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-16 11:39:11,575 - INFO - updating model with new documents
2019-01-16 11:39:11,575 - INFO - using 100 extra samples and 2 power iterations
2019-01-16 11:39:11,575 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-16 11:39:12,637 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-16 11:39:18,284 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:40:16,122 - DEBUG - running 2 power iterations
2019-01-16 11:40:23,740 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:41:26,022 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:42:28,981 - INFO - 2nd phase: running dense svd on (400, 1000) matrix
2019-01-16 11:42:33,172 - INFO - computing the final decomposition
2019-01-16 11:42:33,174 - INFO - keeping 300 factors (discarding 13.873% of energy spectrum)
2019-01-16 11:42:39,870 - INFO - merging projections: (500000, 300) + (500000, 300)
2019-01-16 11:42:39,872 - DEBUG - constructing orthogonal component
2019-01-16 11:42:39,873 - DEBUG - converting (500000, 300) array self.u to FORTRAN order
2019-01-16 11:42:50,973 - DEBUG - converting (500000, 300) array self.u to C order
2019-01-16 11:43:00,259 - DEBUG - computing QR of (500000, 300) dense matrix
Has been in this state for nearly 2 hours now.
Hi @robguinness, that seems to be some issue with the BLAS library you are using. Gensim depends on those to do QR, via scipy (see here).
What's your BLAS?
import numpy, scipy
print(numpy.show_config())
print(scipy.show_config())
Ok, yes, it could be there is some problem with our BLAS setup. Here is the output:
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blas_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
lapack_mkl_info:
NOT AVAILABLE
None
lapack_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/local/lib']
language = f77
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/local/lib']
language = f77
openblas_lapack_info:
libraries = ['openblas']
library_dirs = ['/usr/local/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/local/lib']
language = f77
blas_mkl_info:
NOT AVAILABLE
None
Hm, I wonder why numpy and scipy show different compilation (language f77
vs c
) for OpenBlas. But maybe that's normal, I don't know enough about how numpy/scipy do this, what's expected.
When you CTRL+C interrupt the stuck program -- what is the traceback? Which Gensim line exactly did it get stuck on?
Are there any memory issues on the machine? (IIRC, this is one of the most memory-intensive spots in LSI)
I agree, looks not like a stuck, looks like slow QR (because input matrix is pretty big I guess).
Next logging line appears soon: logger.debug("computing SVD of %s dense matrix", k.shape)
, but if this still not in your logs - QR is bottleneck.
Nah, QR should take seconds, not hours. This looks stuck. But I'm wondering where exactly, and why. It could be some OpenBlas bug, or the machine went OOM and killed some critical processes…
IMHO, I don't think it was OOM. The machine still had quite a lot of memory available when this happened. I have now compiled numpy and scipy against ATLAS and restarted the process. I have understood that OpenBlas has some issues, so I hope ATLAS will be more stable.
When you CTRL+C interrupt the stuck program -- what is the traceback? Which Gensim line exactly did it get stuck on?
I don't have the exact traceback available, but from the main script, it was on lsi_model.add_documents(corpus[i:current_index])
.
Also, regarding OOM possibility, this line was inside a try/except MemoryError. Not sure that would trigger though if the lsi_worker process itself caused the OOM.
Can you send the exact worker traceback next time it gets stuck?
Not sure what you mean by try/except MemoryError
-- is this something inside Gensim, or your code?
Can you send the exact worker traceback next time it gets stuck?
Sure thing.
Not sure what you mean by try/except MemoryError -- is this something inside Gensim, or your code?
In my code. It looks like this:
try:
lsi_model.add_documents(corpus[i:current_index])
except MemoryError as error:
print_memory_usage()
print(error)
exc_type, exc_value, exc_traceback = sys.exc_info()
traceback.print_tb(exc_traceback)
An update...I am now re-running the process, now using libatlas as the underlying BLAS. The workers aren't getting stuck, which is good, but there seems to be another issue...the memory usage of the workers is steadily increasing, which to me seems a bit odd. At least, I've never seen this happen before. The workers should release their memory after each job, correct?
Yes, they do release memory after each job. The Gensim LSI code is pure Python, so that cannot leak.
But it could be some artefact of Python memory management, or some other dependency leaking.
How much of an increase are you seeing (absolute + time derivative)? Does it stop after a while, or keeps growing steadily?
The process has been running for about 6.5 hours, and now the workers are consuming about 24-25 GB each. The growth seems pretty steady, and my guess is it will keep rising until throwing an OOM error. I will add some instrumentation to print more precise memory stats to the log, since now I am just going based on monitoring htop.
I would assume the memory leak is in libatlas, since this did not happen before switching from OpenBlas to libatlas. Any recommendations on which BLAS to use (Linux, please)?
I use OpenBlas, or if not available, ATLAS.
I'm wondering if it's something with your particular setup or HW, as opposed to the library itself. It's really strange. @menshikh-iv were there any changes to LSI on our side lately? Any cythonization, refactoring, …?
@piskvorky in 2018 - only cleanup (improve documentation, better six usage, etc), no optimizations or refactoring.
Ok, I'm going to try updating libatlas, and if it still leaks, then I will try switching to OpenBlas. I will also try to build OpenBlas from source on this machine. I'll keep you updated.
BTW, should I open up a different Github issue, since the memory leak seems to be a completely issue than the original one? (I could be wrong about that, but my gut tells me they are unrelated.)
No, let's keep it in one place. I'm still not sure what the actual issue is, the BLAS leak is just a hypothesis (very strange, these libs are well-tested and used widely).
@robguinness how dense is your corpus
? Can you post some stats? (number of docs, terms, non-zeros)
Are there any atypical documents / document blocks inside the corpus? Such as "most documents are sparse, but then mid-way there are 10,000 documents which are fully dense…" (I'm thinking that could be another reason why at some point, the memory suddenly shoots up)
Sure, here are some stats:
n_docs: 7138401 dim: 500000 nnz: 2188951661 (computed by loading each shard as a CSR and summing up all the nnz values in the corpus) Average nnz per doc: 306.6
So it's pretty sparse, and I wouldn't expect any steep spikes in denseness.
I've upgraded the system to Python 3.7.2, but workers seem to be still using increasingly more memory as the process runs. So I suspect a leak somewhere, but I'm not sure where either.
Thanks. What's the maximum nnz in one doc? max(len(doc) for doc in corpus)
Also, what the type of corpus
? You talk about some shards and CSR, so I guess it's not "normal" MatrixMarket .mm
/ .mm.gz
?
Here is the max, as well as a few other statistics concerning nnz that may (or may not) be useful:
max: 17687 min: 0 st.dev: 311.85 num. docs with nnz>5000: 811
I'm using ShardedCorpus.
OK, thanks. That looks fine.
I'm not very familiar with ShardedCorpus
-- can you try with normal MmCorpus
too? To rule out an issue with the input storage.
MmCorpus.serialize("/tmp/corpus.mm", corpus)
(you can also compress the resulting .mm
file to .mm.gz
or .mm.bz2
to save space, gensim can open those compressed formats natively, without unpacking)
Do you mean convert the sharded corpus to an MmCorpus, or regenerate the corpus from scratch as an MmCorpus?
I naively tried the first option, i.e:
corpus = ShardedCorpus.load(corpus_path)
corpus = MmCorpus.serialize("/data/tmp/corpus.mm", corpus)
But this fails with the following output:
[2019-01-22 19:16:10,023] [30478] [gensim.corpora.mmcorpus] [INFO] {save_corpus:122} storing corpus in Matrix Market format to /data/tmp/corpus.mm
[2019-01-22 19:16:10,023] [30478] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/tmp/corpus.mm', 'mode': 'wb+', 'kw': {}}
[2019-01-22 19:16:10,023] [30478] [gensim.matutils] [INFO] {write_headers:1244} saving sparse matrix to /data/tmp/corpus.mm
[2019-01-22 19:16:10,023] [30478] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/r_and_d/rob/data/tmp/corpus.0', 'mode': 'rb', 'kw': {}}
[2019-01-22 19:16:10,080] [30478] [gensim.matutils] [INFO] {write_corpus:1356} PROGRESS: saving document #0
Traceback (most recent call last):
File "scripts/main_model_create.py", line 321, in <module>
main()
File "scripts/main_model_create.py", line 300, in main
corpus, family_ids = get_corpus(phraser, bow, tfidf)
File "scripts/main_model_create.py", line 200, in get_corpus
corpus = MmCorpus.serialize("/data/tmp/corpus.mm", corpus)
File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/corpora/indexedcorpus.py", line 123, in serialize
offsets = serializer.save_corpus(fname, corpus, id2word, **kwargs)
File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/corpora/mmcorpus.py", line 125, in save_corpus
fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata
File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1363, in write_corpus
max_id, veclen = mw.write_vector(docno, bow)
File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1292, in write_vector
vector = sorted((i, w) for i, w in vector if abs(w) > 1e-12) # ignore near-zero entries
File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1292, in <genexpr>
vector = sorted((i, w) for i, w in vector if abs(w) > 1e-12) # ignore near-zero entries
ValueError: too many values to unpack (expected 2)
[2019-01-22 19:16:10,640] [30478] [gensim.matutils] [DEBUG] {close:1398} closing /data/tmp/corpus.mm
Makefile:12: recipe for target 'model' failed
make: *** [model] Error 1
If it is not possible to convert a ShardedCorpus to MmCorpus, then I can regenerate it, but it will take awhile.
I meant convert it (not re-generate).
Reading the ShardedCorpus docs, apparently iterating over it works differently than other Gensim corpora. To get standard behaviour, you'll need to set .gensim = True
first:
# Convert ShardedCorpus into MatrixMarket format.
sharded_corpus = ShardedCorpus.load(corpus_path)
sharded_corpus.gensim = True
MmCorpus.serialize("/data/tmp/corpus.mm", sharded_corpus)
# Optionally, manually compress /data/tmp/corpus.mm to /data/tmp/corpus.mm.gz
# Load the converted MatrixMarket corpus, keep working with that from now on.
mm_corpus = MmCorpus("/data/tmp/corpus.mm.gz")
I suspect this might also be related to your memory woes: the ShardedCorpus iterator returns vectors of a different type by default (numpy arrays), whereas all models expect in the vectors in standard Gensim format. Let me know if setting the .gensim
attribute helped!
@robguinness Why are you using ShardedCorpus in the first place?
I actually tried that already, and it still gave the same error. I will try again to be doubly-sure. Regarding the choice of ShardedCorpus, this was made before I joined the company, but I think the thinking was probably due to the large size of the corpus.
I just checked from our code, and the gensim
attribute is already set to true when the corpus was generated:
corpus = ShardedCorpus(
config['DATA']['CORPUS'],
corpus_generator(),
dim=bow_dimension,
gensim=True,
shardsize=corpus_shard_size,
sparse_serialization=True
)
@robguinness check what ShardedCorpus generates. it should be docs with in a bow format.
But I think that because of _getitem_sparse2gensim
Check this calls of ShardedCorpus methods: __iter__
-> __getitem__ from number
-> _getitem_format(s_result)
-> _getitem_sparse2gensim(s_result)
-> row_sparse2gensim
I think _getitem_sparse2gensim(s_result)
generates tuple of bows but should only bow.
@robguinness what does next(iter(sharded_corpus))
give you?
what does next(iter(sharded_corpus)) give you?
In [86]: next(iter(sharded_corpus))
Out[86]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b53b9fed0>
Check this calls of ShardedCorpus methods: iter -> getitem from number -> _getitem_format(s_result) -> _getitem_sparse2gensim(s_result) -> row_sparse2gensim
In [87]: sharded_corpus.__iter__()
Out[87]: <generator object ShardedCorpus.__iter__ at 0x7f5b521bbd68>
In [88]: next(sharded_corpus.__iter__())
Out[88]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b53691228>
In [89]: sharded_corpus.__getitem__(0)
Out[89]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b2b9c8930>
In [90]: next(sharded_corpus.__getitem__(0))
Out[90]:
[(1851, array(0.02226481)),
...
(494311, array(0.01063436)),
(498611, array(0.0338675))]
(I'm not really sure what the argument s_result
should be in the remaining calls.)
I'm currently trying a small hack to load the entire corpus into memory as a list:
corpus_all = []
for c_gen in corpus:
for c in c_gen:
corpus_all.append(c)
Not sure if it will fit though...
Out[86]: <generator object ShardedCorpus._getitem_sparse2gensim.
. at 0x7f5b53b9fed0>
Oh, I'm an idiot. Of course, line output = (row_sparse2gensim(i, result) for i in range(result.shape[0]))
assign to output
an iterator, not tuple.
(I'm not really sure what the argument
s_result
should be in the remaining calls.)
I've copypasted exactly function callings from code to ease orientation in code. Thanks for a great answer with a slice of shell. It's very helpful.
'm currently trying a small hack to load the entire corpus into memory as a list:
MmWriter can receive (according to code) an iterator of docs if you don't care or don't have metadata in corpus. So you can minimize memory consumption and not load all in memory.
@robguinness that looks like a bug in ShardedCorpus, the result ought to be a sparse vector in Gensim format: a list of 2-tuples (feature id, feature weight)
, e.g. [(1, 0.3), (12, 0.14), …]
.
Can you give me list(next(iter(sharded_corpus)))
as well?
Can you give me list(next(iter(sharded_corpus))) as well?
In [93]: list(next(iter(sharded_corpus)))
Out[93]:
[[(1851, array(0.02226481)),
(14120, array(0.03600967)),
...
(498611, array(0.0338675))]]
I managed to create an MmCorpus and use that as the input to the LsiModel, but the same leaky behavior occurs. After each job, the amount of memory used by a worker increases by roughly the same amount. I think the problem is really that the workers are not releasing the memory, as they should.
@robguinness That doesn't look right, and will not work. Can you send your output for next(iter(mm_corpus))
?
@piskvorky bug in the ShardedCorpus._getitem_sparse2gensim
I've already written that.
Can you send your output for next(iter(mm_corpus))?
In [10]: next(iter(mm_corpus))
Out[10]:
[(1851, 0.022264811360803896),
(14120, 0.0360096687954834),
...
(498611, 0.033867502003783745)]
Last night I added some memory usage logging via the tracemalloc
module. Every time the worker calls reset()
, I take a snapshot and call snapshot.statistics('lineno')
, then print out the top ten results. Here is how the results look for one worker after running for a few hours:
2019-01-24 03:07:40,067 - INFO - resetting worker #0
2019-01-24 03:07:42,245 - DEBUG - [ Top 10 ]
2019-01-24 03:07:42,245 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=22.4 GiB, count=40, average=572 MiB
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:309: size=8011 MiB, count=14, average=572 MiB
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680846, average=63 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=109 KiB, count=210, average=532 B
2019-01-24 03:07:42,246 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:255: size=48.4 KiB, count=40, average=1240 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=37.7 KiB, count=536, average=72 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/scipy/linalg/decomp_svd.py:129: size=34.5 KiB, count=49, average=720 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=31.9 KiB, count=583, average=56 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=22.1 KiB, count=238, average=95 B
Here are lines 252-256 of lsimodel.py
:
if self.u is None:
# we are empty => result of merge is the other projection, whatever it is
self.u = other.u.copy()
self.s = other.s.copy()
return
I am wondering if perhaps the other
object is not getting garbage collected for some strange reason. I will try to explicitly del
it before the return
and see if this helps. Any other ideas are most welcomed!
P.S. I have read elsewhere that, while classical memory leaks are not possible within Python, they can occur in Python applications when using C extensions OR memory can grow unexpectedly (but technically not a memory leak) due to things like circular references. See, for example: http://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-python-memory-leaks/ https://blogitplus.com/post/finding-and-fixing-memory-leaks-in-python-pid6351
@robguinness and this is with MmCorpus, right?
Because while the iteration output for MmCorpus is correct, the output for ShardedCorpus is broken. I don't understand how ShardedCorpus even worked for you, that should have been throwing exceptions right away. Weird.
If you're seeing the leak with MmCorpus too, the only way forward is to debug in detail. Does the issue happen even in serial mode (no distributed workers)?
Or try an older version such as 0.13.4 (LSI didn't significantly change in years), to rule out hidden errors due some recent refactoring.
The above output was using ShardedCorpus with all ~7 million docs, but I saw the same behavior using MmCorpus with a smaller number of docs (since the full set won't fit into memory).
That's really strange about ShardedCorpus because we have used this in our code for the last few years without seeing this particular.
Does the issue happen even in serial mode (no distributed workers)?
In serial mode, there are no workers, right, so I wouldn't see quite the same signature. But I can do some tests to see if the memory of the main process progressively grows.
I can try downgrading gensim, too, but I'm really starting to suspect that the problem is somewhere below gensim...e.g. numpy, scipy, BLAS. You really mean 0.13.4 of gensim? ...that is quite old indeed! ;-)
MmCorpus doesn't load the full document set into memory, so that's no reason to use ShardedCorpus.
I'd propose you avoid ShardedCorpus, not sure what's going on there… maybe some recent refactoring… I'm not familiar with that module. But I'd expect our unit tests to catch such a fundamental problem, strange. Either way, can you try training with MmCorpus?
@robguinness How much gain do you see from using distributed LSI? How many workers / how many hours, compared to a simple serial LsiModel(mm_corpus)
?
And yes, 0.13.4 is old… but LSI/SVD is even older :-) Let's start by ditching ShardedCorpus first though.
Here is some output from a run using MmCorpus (still with gensim 3.6.0, I'll try downgrading next). The process has been running for about an hour, and you can see the number of objects and overall size increases with each job (see lsimodel.py:254
):
2019-01-24 14:55:36,804 - INFO - running /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsi_worker.py
2019-01-24 14:55:36,883 - INFO - gensim.lsi_worker.3dc919 registered with nameserver (URI 'PYRO:gensim.lsi_worker.3dc919@127.0.0.1:36877')
2019-01-24 14:56:03,635 - INFO - initializing worker #7
2019-01-24 14:56:03,651 - INFO - using serial LSI version on this node
2019-01-24 14:56:21,790 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 14:56:21,791 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 14:56:21,791 - INFO - resetting worker #7
2019-01-24 14:56:24,161 - DEBUG - [ Top 10 ]
2019-01-24 14:56:24,162 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680849, average=63 B
2019-01-24 14:56:24,162 - DEBUG - <frozen importlib._bootstrap_external>:525: size=64.3 KiB, count=737, average=89 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=17.7 KiB, count=181, average=100 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:348: size=3696 B, count=7, average=528 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:552: size=3264 B, count=6, average=544 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:885: size=2720 B, count=5, average=544 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:500: size=2664 B, count=57, average=47 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:365: size=2640 B, count=5, average=528 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:917: size=2376 B, count=5, average=475 B
2019-01-24 14:56:24,163 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/naming.py:30: size=2318 B, count=12, average=193 B
2019-01-24 14:56:24,305 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 14:58:01,482 - INFO - worker #7 received job #0
2019-01-24 14:58:01,483 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 14:58:01,483 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 14:58:01,483 - INFO - updating model with new documents
2019-01-24 14:58:01,483 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 14:58:01,483 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 14:58:02,027 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 14:58:04,532 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 14:59:00,483 - DEBUG - running 2 power iterations
2019-01-24 14:59:04,533 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:00:04,186 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:01:02,732 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:01:03,850 - INFO - computing the final decomposition
2019-01-24 15:01:03,850 - INFO - keeping 300 factors (discarding 10.982% of energy spectrum)
2019-01-24 15:01:15,395 - INFO - processed sparse job of 2000 documents
2019-01-24 15:01:15,398 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:11:58,263 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:11:58,314 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:11:58,314 - INFO - worker #7 returning its state after 1 jobs
2019-01-24 15:11:58,314 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:11:59,704 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:13:31,763 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:13:31,764 - INFO - resetting worker #7
2019-01-24 15:13:34,022 - DEBUG - [ Top 10 ]
2019-01-24 15:13:34,022 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=1144 MiB, count=2, average=572 MiB
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=70.6 KiB, count=130, average=556 B
2019-01-24 15:13:34,023 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=32.3 KiB, count=460, average=72 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=25.3 KiB, count=463, average=56 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=19.2 KiB, count=196, average=100 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=12.3 KiB, count=135, average=93 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsi_worker.py:129: size=11.4 KiB, count=30, average=387 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:425: size=6696 B, count=45, average=149 B
2019-01-24 15:13:34,178 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 15:14:15,130 - INFO - worker #7 received job #1
2019-01-24 15:14:15,130 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 15:14:15,130 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 15:14:15,130 - INFO - updating model with new documents
2019-01-24 15:14:15,131 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 15:14:15,131 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 15:14:15,616 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 15:14:18,037 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:15:12,591 - DEBUG - running 2 power iterations
2019-01-24 15:15:16,596 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:16:18,061 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:17:15,299 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:17:16,504 - INFO - computing the final decomposition
2019-01-24 15:17:16,504 - INFO - keeping 300 factors (discarding 11.928% of energy spectrum)
2019-01-24 15:17:29,885 - INFO - processed sparse job of 2000 documents
2019-01-24 15:17:29,889 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:28:41,883 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:28:41,884 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:28:41,884 - INFO - worker #7 returning its state after 2 jobs
2019-01-24 15:28:41,884 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:28:43,161 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:30:14,545 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:30:14,545 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 15:30:14,546 - INFO - resetting worker #7
2019-01-24 15:30:16,773 - DEBUG - [ Top 10 ]
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=2289 MiB, count=4, average=572 MiB
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=78.9 KiB, count=175, average=462 B
2019-01-24 15:30:16,774 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:30:16,774 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=41.0 KiB, count=583, average=72 B
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:452: size=37.6 KiB, count=339, average=114 B
2019-01-24 15:30:16,774 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=34.4 KiB, count=629, average=56 B
2019-01-24 15:30:16,775 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/socketutil.py:151: size=22.9 KiB, count=39, average=600 B
2019-01-24 15:30:16,775 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=18.7 KiB, count=187, average=102 B
2019-01-24 15:30:16,775 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=18.7 KiB, count=214, average=89 B
2019-01-24 15:30:16,914 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 15:31:26,012 - INFO - worker #7 received job #2
2019-01-24 15:31:26,013 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 15:31:26,013 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 15:31:26,013 - INFO - updating model with new documents
2019-01-24 15:31:26,013 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 15:31:26,013 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 15:31:26,506 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 15:31:29,215 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:32:28,062 - DEBUG - running 2 power iterations
2019-01-24 15:32:32,380 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:33:32,141 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:34:28,524 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:34:29,620 - INFO - computing the final decomposition
2019-01-24 15:34:29,621 - INFO - keeping 300 factors (discarding 11.348% of energy spectrum)
2019-01-24 15:34:42,370 - INFO - processed sparse job of 2000 documents
2019-01-24 15:34:42,374 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:45:23,253 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:45:23,253 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:45:23,253 - INFO - worker #7 returning its state after 3 jobs
2019-01-24 15:45:23,254 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:45:24,522 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:46:56,122 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:46:56,122 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 15:46:56,122 - INFO - resetting worker #7
2019-01-24 15:46:58,404 - DEBUG - [ Top 10 ]
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=3433 MiB, count=6, average=572 MiB
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=85.2 KiB, count=155, average=563 B
2019-01-24 15:46:58,405 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=40.8 KiB, count=580, average=72 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=34.4 KiB, count=629, average=56 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=18.7 KiB, count=187, average=102 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=12.4 KiB, count=140, average=91 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:425: size=7968 B, count=52, average=153 B
2019-01-24 15:46:58,406 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:255: size=7440 B, count=6, average=1240 B
2019-01-24 15:46:58,537 - DEBUG - releasing lock 'lock_update' for reset
If I let it run for a couple more hours, I'm quite sure it will keep increasing. I will kill it now though and try downgrading to 0.13.4 to see if this issue goes away.
How much gain do you see from using distributed LSI? How many workers / how many hours, compared to a simple serial LsiModel(mm_corpus)?
Quite a bit of gain, but I don't have exact numbers handy. In this run I am using 8 workers, but depending on the other parameters (BOW dimension, number of topics, etc.), I have used as few as 4 as many as 16.
FYI, downgrading to gensim 0.13.4 also causes downgrades to numpy-1.16.0 and scipy-1.2.0. Just a note that if the problem goes away, it might be hard to trace down the cause, since we won't know if it was a bug in the more recent releases of these packages.
I got an error when trying to run under gensim 0.13.4:
[2019-01-24 16:52:57,336] [13890] [root] [INFO] {main:312} Start - generating model
[2019-01-24 16:52:57,336] [13890] [gensim.utils] [INFO] {load:247} loading Phraser object from /data/r_and_d/teqmine-ai/data/phraser
[2019-01-24 16:52:57,336] [13890] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/r_and_d/teqmine-ai/data/phraser', 'mode': 'rb', 'kw': {}}
Traceback (most recent call last):
File "scripts/main_model_create.py", line 337, in <module>
main()
File "scripts/main_model_create.py", line 314, in main
phraser = get_phraser()
File "scripts/main_model_create.py", line 118, in get_phraser
phraser = Phraser.load(phraser_path)
File "/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/utils.py", line 251, in load
obj = unpickle(fname)
File "/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/utils.py", line 916, in unpickle
return _pickle.load(f, encoding='latin1')
AttributeError: Can't get attribute 'original_scorer' on <module 'gensim.models.phrases' from '/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/phrases.py'>
It seems that the phraser that was built with gensim 3.6.0 is not compatible with 0.14.3. I would have to regenerate it, which for the whole 7 million documents will take awhile. So please give me a few days.
@robguinness I think problem is not in gensim itself, but libatlas. Of course, copy of big array causes a large allocations, but tracemalloc shows allocations of python only. so if memory leak is in other library we will not see this.
Of course, copy of big array causes a large allocations, but tracemalloc shows allocations of python only. so if memory leak is in other library we will not see this.
@horpto I'm not sure I completely follow you. The memory usage of Python is growing in the worker, as can be seen with tracemalloc
in the output above, which is unexpected behavior since the worker should be flushing all the data it holds after each job. It may not be a "memory leak" in the classical sense, but still causes the Python memory usage to grow with each worker job.
What do these tracemalloc
numbers mean? If it's just a cumulative sum of memory allocated at a particular line, that may be perfectly OK as long as the memory gets released later. Does an increasing number really imply a leak?
What does top
/ RAM monitor
show for RSS RAM when training with MmCorpus in serial mode?
@robguinness Before downgrading, I'd urge you to try training with MmCorpus (it's streamed, fast and doesn't load all docs into RAM). No ShardedCorpus. Especially if you can reproduce the issue in serial LSI mode too, that simplifies testing and replication tremendously.
Phraser
seems unrelated to LSI, where is that coming from? Let's try to get to a minimum example, removing moving pieces unrelated to the problem. Just the input (MmCorpus) + training (LsiModel) please.
@robguinness ok, you're right. I've tested tracemalloc on this example:
import tracemalloc as tm
def test():
return [i*i for i in range(1_000_000)]
tm.start()
snap1 = tm.take_snapshot()
a=test()
snap2 = tm.take_snapshot()
top_stats = snap2.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
print("DIFF")
top_stats = snap2.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
print(stat)
a=test()
snap3 = tm.take_snapshot()
top_stats = snap3.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
print("DIFF")
top_stats = snap3.compare_to(snap2, 'lineno')
for stat in top_stats[:10]:
print(stat)
print("DIFF2")
top_stats = snap3.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
print(stat)
If turn off distributed mode will memory issue stay? (I think yes, it stays)
Then, one of syspicious part of code for me is qr_destroy
. I'd mock it, for example, to just return an array with random data without any real calculations.
I'm out sick today, so I'll probably get back to troubleshooting on Monday. One small comment for @piskvorky...you are right, Phraser is completely unrelated to this. We were unnecessarily loading the serialized Phraser object in our code. I can just get rid of that line completely.
Thanks for the help and suggestions from everyone.
Hi all, I have tentatively tracked the memory leak down to scipy/numpy. I'm using an older release of both libraries now, and the problem seems to have gone away. I will try to reproduce the problem and narrow down the problem more conclusively, but at the moment, we need to get this model built so the machine is unavailable.
No problem. Make sure to replace ShardedCorpus
with MmCorpus
in any case.
Description
When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".
Steps/Code/Corpus to Reproduce
LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.
Gensim version: 3.6.0 Pyro4 version: 4.63
Expected Results
Process should run to completion
Actual Results
Main script output:
Output of LSI worker that is stuck:
CPU for that LSI worker has been ~100% for >24 hours.
Versions
Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] NumPy 1.15.2 SciPy 1.1.0 gensim 3.6.0 FAST_VERSION 1