piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

LSI worker getting "stuck" #2317

Closed robguinness closed 5 years ago

robguinness commented 5 years ago

Description

When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".

Steps/Code/Corpus to Reproduce

lsi_model = LsiModel(
        id2word=bow,
        num_topics=300,
        chunksize=5000,
        distributed=True
    )
lsi_model.add_documents(corpus)

LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.

Gensim version: 3.6.0 Pyro4 version: 4.63

Expected Results

Process should run to completion

Actual Results

Main script output:

[2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:462} updating model with new documents
[2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:485} initializing 8 workers
[2019-01-06 04:05:12,131] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:05:12,135] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:05:12,497] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #0
[2019-01-06 04:05:12,541] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #5000
[2019-01-06 04:06:46,191] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:06:46,200] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:06:46,618] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #1
[2019-01-06 04:06:46,682] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #10000
[2019-01-06 04:08:11,839] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:08:11,843] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:08:12,561] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #2
[2019-01-06 04:08:12,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #15000
[2019-01-06 04:09:48,217] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:09:48,230] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:09:48,700] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #3
[2019-01-06 04:09:48,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #20000
[2019-01-06 04:09:48,938] [23465] [gensim.models.lsimodel] [INFO] {add_documents:518} reached the end of input; now waiting for all remaining jobs to finish

Output of LSI worker that is stuck:

2019-01-06 04:04:09,867 - INFO - resetting worker #1
2019-01-06 04:06:46,705 - INFO - worker #1 received job #208
2019-01-06 04:06:46,705 - INFO - updating model with new documents
2019-01-06 04:06:46,705 - INFO - using 100 extra samples and 2 power iterations
2019-01-06 04:06:46,705 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-06 04:06:48,402 - INFO - orthonormalizing (500000, 400) action matrix

CPU for that LSI worker has been ~100% for >24 hours.

Versions

Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] NumPy 1.15.2 SciPy 1.1.0 gensim 3.6.0 FAST_VERSION 1

horpto commented 5 years ago

Hi @robguinness

Can you increase logging level of lsi_worker to debug value (gensim.models.lsi_worker, line with logging.basicConfig) and print output here?

robguinness commented 5 years ago

Will do ASAP. It might be a day or two before I can get back to this.

robguinness commented 5 years ago

Just a small update: I have restarted the modeling process with logging level now set to DEBUG. I'll update when/if it gets stuck again.

robguinness commented 5 years ago

I hit the locked up situation again this morning. Here is a log from the LSI worker that is locked up:

2019-01-16 11:39:11,574 - INFO - worker #6 received job #146
2019-01-16 11:39:11,574 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-16 11:39:11,574 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-16 11:39:11,575 - INFO - updating model with new documents
2019-01-16 11:39:11,575 - INFO - using 100 extra samples and 2 power iterations
2019-01-16 11:39:11,575 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-16 11:39:12,637 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-16 11:39:18,284 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:40:16,122 - DEBUG - running 2 power iterations
2019-01-16 11:40:23,740 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:41:26,022 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-16 11:42:28,981 - INFO - 2nd phase: running dense svd on (400, 1000) matrix
2019-01-16 11:42:33,172 - INFO - computing the final decomposition
2019-01-16 11:42:33,174 - INFO - keeping 300 factors (discarding 13.873% of energy spectrum)
2019-01-16 11:42:39,870 - INFO - merging projections: (500000, 300) + (500000, 300)
2019-01-16 11:42:39,872 - DEBUG - constructing orthogonal component
2019-01-16 11:42:39,873 - DEBUG - converting (500000, 300) array self.u to FORTRAN order
2019-01-16 11:42:50,973 - DEBUG - converting (500000, 300) array self.u to C order
2019-01-16 11:43:00,259 - DEBUG - computing QR of (500000, 300) dense matrix

Has been in this state for nearly 2 hours now.

piskvorky commented 5 years ago

Hi @robguinness, that seems to be some issue with the BLAS library you are using. Gensim depends on those to do QR, via scipy (see here).

What's your BLAS?

import numpy, scipy
print(numpy.show_config())
print(scipy.show_config())
robguinness commented 5 years ago

Ok, yes, it could be there is some problem with our BLAS setup. Here is the output:


lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
None
lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_mkl_info:  
  NOT AVAILABLE 
None
piskvorky commented 5 years ago

Hm, I wonder why numpy and scipy show different compilation (language f77 vs c) for OpenBlas. But maybe that's normal, I don't know enough about how numpy/scipy do this, what's expected.

When you CTRL+C interrupt the stuck program -- what is the traceback? Which Gensim line exactly did it get stuck on?

Are there any memory issues on the machine? (IIRC, this is one of the most memory-intensive spots in LSI)

menshikh-iv commented 5 years ago

I agree, looks not like a stuck, looks like slow QR (because input matrix is pretty big I guess). Next logging line appears soon: logger.debug("computing SVD of %s dense matrix", k.shape), but if this still not in your logs - QR is bottleneck.

piskvorky commented 5 years ago

Nah, QR should take seconds, not hours. This looks stuck. But I'm wondering where exactly, and why. It could be some OpenBlas bug, or the machine went OOM and killed some critical processes…

robguinness commented 5 years ago

IMHO, I don't think it was OOM. The machine still had quite a lot of memory available when this happened. I have now compiled numpy and scipy against ATLAS and restarted the process. I have understood that OpenBlas has some issues, so I hope ATLAS will be more stable.

When you CTRL+C interrupt the stuck program -- what is the traceback? Which Gensim line exactly did it get stuck on?

I don't have the exact traceback available, but from the main script, it was on lsi_model.add_documents(corpus[i:current_index]).

Also, regarding OOM possibility, this line was inside a try/except MemoryError. Not sure that would trigger though if the lsi_worker process itself caused the OOM.

piskvorky commented 5 years ago

Can you send the exact worker traceback next time it gets stuck?

Not sure what you mean by try/except MemoryError -- is this something inside Gensim, or your code?

robguinness commented 5 years ago

Can you send the exact worker traceback next time it gets stuck?

Sure thing.

Not sure what you mean by try/except MemoryError -- is this something inside Gensim, or your code?

In my code. It looks like this:

        try:
            lsi_model.add_documents(corpus[i:current_index])
        except MemoryError as error:
            print_memory_usage()
            print(error)
            exc_type, exc_value, exc_traceback = sys.exc_info()
            traceback.print_tb(exc_traceback)
robguinness commented 5 years ago

An update...I am now re-running the process, now using libatlas as the underlying BLAS. The workers aren't getting stuck, which is good, but there seems to be another issue...the memory usage of the workers is steadily increasing, which to me seems a bit odd. At least, I've never seen this happen before. The workers should release their memory after each job, correct?

piskvorky commented 5 years ago

Yes, they do release memory after each job. The Gensim LSI code is pure Python, so that cannot leak.

But it could be some artefact of Python memory management, or some other dependency leaking.

How much of an increase are you seeing (absolute + time derivative)? Does it stop after a while, or keeps growing steadily?

robguinness commented 5 years ago

The process has been running for about 6.5 hours, and now the workers are consuming about 24-25 GB each. The growth seems pretty steady, and my guess is it will keep rising until throwing an OOM error. I will add some instrumentation to print more precise memory stats to the log, since now I am just going based on monitoring htop.

I would assume the memory leak is in libatlas, since this did not happen before switching from OpenBlas to libatlas. Any recommendations on which BLAS to use (Linux, please)?

piskvorky commented 5 years ago

I use OpenBlas, or if not available, ATLAS.

I'm wondering if it's something with your particular setup or HW, as opposed to the library itself. It's really strange. @menshikh-iv were there any changes to LSI on our side lately? Any cythonization, refactoring, …?

menshikh-iv commented 5 years ago

@piskvorky in 2018 - only cleanup (improve documentation, better six usage, etc), no optimizations or refactoring.

robguinness commented 5 years ago

Ok, I'm going to try updating libatlas, and if it still leaks, then I will try switching to OpenBlas. I will also try to build OpenBlas from source on this machine. I'll keep you updated.

BTW, should I open up a different Github issue, since the memory leak seems to be a completely issue than the original one? (I could be wrong about that, but my gut tells me they are unrelated.)

piskvorky commented 5 years ago

No, let's keep it in one place. I'm still not sure what the actual issue is, the BLAS leak is just a hypothesis (very strange, these libs are well-tested and used widely).

@robguinness how dense is your corpus? Can you post some stats? (number of docs, terms, non-zeros) Are there any atypical documents / document blocks inside the corpus? Such as "most documents are sparse, but then mid-way there are 10,000 documents which are fully dense…" (I'm thinking that could be another reason why at some point, the memory suddenly shoots up)

robguinness commented 5 years ago

Sure, here are some stats:

n_docs: 7138401 dim: 500000 nnz: 2188951661 (computed by loading each shard as a CSR and summing up all the nnz values in the corpus) Average nnz per doc: 306.6

So it's pretty sparse, and I wouldn't expect any steep spikes in denseness.

I've upgraded the system to Python 3.7.2, but workers seem to be still using increasingly more memory as the process runs. So I suspect a leak somewhere, but I'm not sure where either.

piskvorky commented 5 years ago

Thanks. What's the maximum nnz in one doc? max(len(doc) for doc in corpus)

Also, what the type of corpus? You talk about some shards and CSR, so I guess it's not "normal" MatrixMarket .mm / .mm.gz?

robguinness commented 5 years ago

Here is the max, as well as a few other statistics concerning nnz that may (or may not) be useful:

max: 17687 min: 0 st.dev: 311.85 num. docs with nnz>5000: 811

I'm using ShardedCorpus.

piskvorky commented 5 years ago

OK, thanks. That looks fine.

I'm not very familiar with ShardedCorpus -- can you try with normal MmCorpus too? To rule out an issue with the input storage.

MmCorpus.serialize("/tmp/corpus.mm", corpus)

(you can also compress the resulting .mm file to .mm.gz or .mm.bz2 to save space, gensim can open those compressed formats natively, without unpacking)

robguinness commented 5 years ago

Do you mean convert the sharded corpus to an MmCorpus, or regenerate the corpus from scratch as an MmCorpus?

I naively tried the first option, i.e:

corpus = ShardedCorpus.load(corpus_path)
corpus = MmCorpus.serialize("/data/tmp/corpus.mm", corpus)

But this fails with the following output:

[2019-01-22 19:16:10,023] [30478] [gensim.corpora.mmcorpus] [INFO] {save_corpus:122} storing corpus in Matrix Market format to /data/tmp/corpus.mm
[2019-01-22 19:16:10,023] [30478] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/tmp/corpus.mm', 'mode': 'wb+', 'kw': {}}
[2019-01-22 19:16:10,023] [30478] [gensim.matutils] [INFO] {write_headers:1244} saving sparse matrix to /data/tmp/corpus.mm
[2019-01-22 19:16:10,023] [30478] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/r_and_d/rob/data/tmp/corpus.0', 'mode': 'rb', 'kw': {}}
[2019-01-22 19:16:10,080] [30478] [gensim.matutils] [INFO] {write_corpus:1356} PROGRESS: saving document #0
Traceback (most recent call last):
  File "scripts/main_model_create.py", line 321, in <module>
    main()
  File "scripts/main_model_create.py", line 300, in main
    corpus, family_ids = get_corpus(phraser, bow, tfidf)
  File "scripts/main_model_create.py", line 200, in get_corpus
    corpus = MmCorpus.serialize("/data/tmp/corpus.mm", corpus)
  File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/corpora/indexedcorpus.py", line 123, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, **kwargs)
  File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/corpora/mmcorpus.py", line 125, in save_corpus
    fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata
  File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1363, in write_corpus
    max_id, veclen = mw.write_vector(docno, bow)
  File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1292, in write_vector
    vector = sorted((i, w) for i, w in vector if abs(w) > 1e-12)  # ignore near-zero entries
  File "/data/r_and_d/rob/.venv/lib/python3.7/site-packages/gensim/matutils.py", line 1292, in <genexpr>
    vector = sorted((i, w) for i, w in vector if abs(w) > 1e-12)  # ignore near-zero entries
ValueError: too many values to unpack (expected 2)
[2019-01-22 19:16:10,640] [30478] [gensim.matutils] [DEBUG] {close:1398} closing /data/tmp/corpus.mm
Makefile:12: recipe for target 'model' failed
make: *** [model] Error 1

If it is not possible to convert a ShardedCorpus to MmCorpus, then I can regenerate it, but it will take awhile.

piskvorky commented 5 years ago

I meant convert it (not re-generate).

Reading the ShardedCorpus docs, apparently iterating over it works differently than other Gensim corpora. To get standard behaviour, you'll need to set .gensim = True first:

# Convert ShardedCorpus into MatrixMarket format.
sharded_corpus = ShardedCorpus.load(corpus_path)
sharded_corpus.gensim = True
MmCorpus.serialize("/data/tmp/corpus.mm", sharded_corpus)

# Optionally, manually compress /data/tmp/corpus.mm to /data/tmp/corpus.mm.gz

# Load the converted MatrixMarket corpus, keep working with that from now on.
mm_corpus = MmCorpus("/data/tmp/corpus.mm.gz")

I suspect this might also be related to your memory woes: the ShardedCorpus iterator returns vectors of a different type by default (numpy arrays), whereas all models expect in the vectors in standard Gensim format. Let me know if setting the .gensim attribute helped!

@robguinness Why are you using ShardedCorpus in the first place?

robguinness commented 5 years ago

I actually tried that already, and it still gave the same error. I will try again to be doubly-sure. Regarding the choice of ShardedCorpus, this was made before I joined the company, but I think the thinking was probably due to the large size of the corpus.

robguinness commented 5 years ago

I just checked from our code, and the gensim attribute is already set to true when the corpus was generated:

corpus = ShardedCorpus(
        config['DATA']['CORPUS'],
        corpus_generator(),
        dim=bow_dimension,
        gensim=True,
        shardsize=corpus_shard_size,
        sparse_serialization=True
    )
horpto commented 5 years ago

@robguinness check what ShardedCorpus generates. it should be docs with in a bow format. But I think that because of _getitem_sparse2gensim Check this calls of ShardedCorpus methods: __iter__ -> __getitem__ from number -> _getitem_format(s_result) -> _getitem_sparse2gensim(s_result) -> row_sparse2gensim

I think _getitem_sparse2gensim(s_result) generates tuple of bows but should only bow.

piskvorky commented 5 years ago

@robguinness what does next(iter(sharded_corpus)) give you?

robguinness commented 5 years ago

what does next(iter(sharded_corpus)) give you?

In [86]: next(iter(sharded_corpus))
Out[86]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b53b9fed0>

Check this calls of ShardedCorpus methods: iter -> getitem from number -> _getitem_format(s_result) -> _getitem_sparse2gensim(s_result) -> row_sparse2gensim

In [87]: sharded_corpus.__iter__()
Out[87]: <generator object ShardedCorpus.__iter__ at 0x7f5b521bbd68>

In [88]: next(sharded_corpus.__iter__())
Out[88]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b53691228>

In [89]: sharded_corpus.__getitem__(0)
Out[89]: <generator object ShardedCorpus._getitem_sparse2gensim.<locals>.<genexpr> at 0x7f5b2b9c8930>

In [90]: next(sharded_corpus.__getitem__(0))
Out[90]: 
[(1851, array(0.02226481)),
...
 (494311, array(0.01063436)),
 (498611, array(0.0338675))]

(I'm not really sure what the argument s_result should be in the remaining calls.)

I'm currently trying a small hack to load the entire corpus into memory as a list:

corpus_all = []
        for c_gen in corpus:  
            for c in c_gen:
                corpus_all.append(c)

Not sure if it will fit though...

horpto commented 5 years ago

Out[86]: <generator object ShardedCorpus._getitem_sparse2gensim.. at 0x7f5b53b9fed0>

Oh, I'm an idiot. Of course, line output = (row_sparse2gensim(i, result) for i in range(result.shape[0])) assign to output an iterator, not tuple.

(I'm not really sure what the argument s_result should be in the remaining calls.)

I've copypasted exactly function callings from code to ease orientation in code. Thanks for a great answer with a slice of shell. It's very helpful.

'm currently trying a small hack to load the entire corpus into memory as a list:

MmWriter can receive (according to code) an iterator of docs if you don't care or don't have metadata in corpus. So you can minimize memory consumption and not load all in memory.

piskvorky commented 5 years ago

@robguinness that looks like a bug in ShardedCorpus, the result ought to be a sparse vector in Gensim format: a list of 2-tuples (feature id, feature weight), e.g. [(1, 0.3), (12, 0.14), …].

Can you give me list(next(iter(sharded_corpus))) as well?

robguinness commented 5 years ago

Can you give me list(next(iter(sharded_corpus))) as well?

In [93]: list(next(iter(sharded_corpus)))
Out[93]: 
[[(1851, array(0.02226481)),
  (14120, array(0.03600967)),
...
  (498611, array(0.0338675))]]

I managed to create an MmCorpus and use that as the input to the LsiModel, but the same leaky behavior occurs. After each job, the amount of memory used by a worker increases by roughly the same amount. I think the problem is really that the workers are not releasing the memory, as they should.

piskvorky commented 5 years ago

@robguinness That doesn't look right, and will not work. Can you send your output for next(iter(mm_corpus))?

horpto commented 5 years ago

@piskvorky bug in the ShardedCorpus._getitem_sparse2gensim I've already written that.

robguinness commented 5 years ago

Can you send your output for next(iter(mm_corpus))?

In [10]: next(iter(mm_corpus))
Out[10]: 
[(1851, 0.022264811360803896),
 (14120, 0.0360096687954834),
...
 (498611, 0.033867502003783745)]

Last night I added some memory usage logging via the tracemalloc module. Every time the worker calls reset(), I take a snapshot and call snapshot.statistics('lineno'), then print out the top ten results. Here is how the results look for one worker after running for a few hours:

2019-01-24 03:07:40,067 - INFO - resetting worker #0
2019-01-24 03:07:42,245 - DEBUG - [ Top 10 ]
2019-01-24 03:07:42,245 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=22.4 GiB, count=40, average=572 MiB
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:309: size=8011 MiB, count=14, average=572 MiB
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680846, average=63 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=109 KiB, count=210, average=532 B
2019-01-24 03:07:42,246 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:255: size=48.4 KiB, count=40, average=1240 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=37.7 KiB, count=536, average=72 B
2019-01-24 03:07:42,246 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/scipy/linalg/decomp_svd.py:129: size=34.5 KiB, count=49, average=720 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=31.9 KiB, count=583, average=56 B
2019-01-24 03:07:42,246 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=22.1 KiB, count=238, average=95 B

Here are lines 252-256 of lsimodel.py:

        if self.u is None:
            # we are empty => result of merge is the other projection, whatever it is
            self.u = other.u.copy()
            self.s = other.s.copy()
            return

I am wondering if perhaps the other object is not getting garbage collected for some strange reason. I will try to explicitly del it before the return and see if this helps. Any other ideas are most welcomed!

P.S. I have read elsewhere that, while classical memory leaks are not possible within Python, they can occur in Python applications when using C extensions OR memory can grow unexpectedly (but technically not a memory leak) due to things like circular references. See, for example: http://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-python-memory-leaks/ https://blogitplus.com/post/finding-and-fixing-memory-leaks-in-python-pid6351

piskvorky commented 5 years ago

@robguinness and this is with MmCorpus, right?

Because while the iteration output for MmCorpus is correct, the output for ShardedCorpus is broken. I don't understand how ShardedCorpus even worked for you, that should have been throwing exceptions right away. Weird.

If you're seeing the leak with MmCorpus too, the only way forward is to debug in detail. Does the issue happen even in serial mode (no distributed workers)?

Or try an older version such as 0.13.4 (LSI didn't significantly change in years), to rule out hidden errors due some recent refactoring.

robguinness commented 5 years ago

The above output was using ShardedCorpus with all ~7 million docs, but I saw the same behavior using MmCorpus with a smaller number of docs (since the full set won't fit into memory).

That's really strange about ShardedCorpus because we have used this in our code for the last few years without seeing this particular.

Does the issue happen even in serial mode (no distributed workers)?

In serial mode, there are no workers, right, so I wouldn't see quite the same signature. But I can do some tests to see if the memory of the main process progressively grows.

I can try downgrading gensim, too, but I'm really starting to suspect that the problem is somewhere below gensim...e.g. numpy, scipy, BLAS. You really mean 0.13.4 of gensim? ...that is quite old indeed! ;-)

piskvorky commented 5 years ago

MmCorpus doesn't load the full document set into memory, so that's no reason to use ShardedCorpus.

I'd propose you avoid ShardedCorpus, not sure what's going on there… maybe some recent refactoring… I'm not familiar with that module. But I'd expect our unit tests to catch such a fundamental problem, strange. Either way, can you try training with MmCorpus?

@robguinness How much gain do you see from using distributed LSI? How many workers / how many hours, compared to a simple serial LsiModel(mm_corpus)?

And yes, 0.13.4 is old… but LSI/SVD is even older :-) Let's start by ditching ShardedCorpus first though.

robguinness commented 5 years ago

Here is some output from a run using MmCorpus (still with gensim 3.6.0, I'll try downgrading next). The process has been running for about an hour, and you can see the number of objects and overall size increases with each job (see lsimodel.py:254):

2019-01-24 14:55:36,804 - INFO - running /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsi_worker.py
2019-01-24 14:55:36,883 - INFO - gensim.lsi_worker.3dc919 registered with nameserver (URI 'PYRO:gensim.lsi_worker.3dc919@127.0.0.1:36877')
2019-01-24 14:56:03,635 - INFO - initializing worker #7
2019-01-24 14:56:03,651 - INFO - using serial LSI version on this node
2019-01-24 14:56:21,790 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 14:56:21,791 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 14:56:21,791 - INFO - resetting worker #7
2019-01-24 14:56:24,161 - DEBUG - [ Top 10 ]
2019-01-24 14:56:24,162 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680849, average=63 B
2019-01-24 14:56:24,162 - DEBUG - <frozen importlib._bootstrap_external>:525: size=64.3 KiB, count=737, average=89 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=17.7 KiB, count=181, average=100 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:348: size=3696 B, count=7, average=528 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:552: size=3264 B, count=6, average=544 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:885: size=2720 B, count=5, average=544 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:500: size=2664 B, count=57, average=47 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:365: size=2640 B, count=5, average=528 B
2019-01-24 14:56:24,163 - DEBUG - /usr/local/lib/python3.7/threading.py:917: size=2376 B, count=5, average=475 B
2019-01-24 14:56:24,163 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/naming.py:30: size=2318 B, count=12, average=193 B
2019-01-24 14:56:24,305 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 14:58:01,482 - INFO - worker #7 received job #0
2019-01-24 14:58:01,483 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 14:58:01,483 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 14:58:01,483 - INFO - updating model with new documents
2019-01-24 14:58:01,483 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 14:58:01,483 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 14:58:02,027 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 14:58:04,532 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 14:59:00,483 - DEBUG - running 2 power iterations
2019-01-24 14:59:04,533 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:00:04,186 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:01:02,732 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:01:03,850 - INFO - computing the final decomposition
2019-01-24 15:01:03,850 - INFO - keeping 300 factors (discarding 10.982% of energy spectrum)
2019-01-24 15:01:15,395 - INFO - processed sparse job of 2000 documents
2019-01-24 15:01:15,398 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:11:58,263 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:11:58,314 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:11:58,314 - INFO - worker #7 returning its state after 1 jobs
2019-01-24 15:11:58,314 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:11:59,704 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:13:31,763 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:13:31,764 - INFO - resetting worker #7
2019-01-24 15:13:34,022 - DEBUG - [ Top 10 ]
2019-01-24 15:13:34,022 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=1144 MiB, count=2, average=572 MiB
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=70.6 KiB, count=130, average=556 B
2019-01-24 15:13:34,023 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=32.3 KiB, count=460, average=72 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=25.3 KiB, count=463, average=56 B
2019-01-24 15:13:34,023 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=19.2 KiB, count=196, average=100 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=12.3 KiB, count=135, average=93 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsi_worker.py:129: size=11.4 KiB, count=30, average=387 B
2019-01-24 15:13:34,023 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:425: size=6696 B, count=45, average=149 B
2019-01-24 15:13:34,178 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 15:14:15,130 - INFO - worker #7 received job #1
2019-01-24 15:14:15,130 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 15:14:15,130 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 15:14:15,130 - INFO - updating model with new documents
2019-01-24 15:14:15,131 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 15:14:15,131 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 15:14:15,616 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 15:14:18,037 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:15:12,591 - DEBUG - running 2 power iterations
2019-01-24 15:15:16,596 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:16:18,061 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:17:15,299 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:17:16,504 - INFO - computing the final decomposition
2019-01-24 15:17:16,504 - INFO - keeping 300 factors (discarding 11.928% of energy spectrum)
2019-01-24 15:17:29,885 - INFO - processed sparse job of 2000 documents
2019-01-24 15:17:29,889 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:28:41,883 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:28:41,884 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:28:41,884 - INFO - worker #7 returning its state after 2 jobs
2019-01-24 15:28:41,884 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:28:43,161 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:30:14,545 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:30:14,545 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 15:30:14,546 - INFO - resetting worker #7
2019-01-24 15:30:16,773 - DEBUG - [ Top 10 ]
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=2289 MiB, count=4, average=572 MiB
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=78.9 KiB, count=175, average=462 B
2019-01-24 15:30:16,774 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:30:16,774 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=41.0 KiB, count=583, average=72 B
2019-01-24 15:30:16,774 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:452: size=37.6 KiB, count=339, average=114 B
2019-01-24 15:30:16,774 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=34.4 KiB, count=629, average=56 B
2019-01-24 15:30:16,775 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/socketutil.py:151: size=22.9 KiB, count=39, average=600 B
2019-01-24 15:30:16,775 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=18.7 KiB, count=187, average=102 B
2019-01-24 15:30:16,775 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=18.7 KiB, count=214, average=89 B
2019-01-24 15:30:16,914 - DEBUG - releasing lock 'lock_update' for reset
2019-01-24 15:31:26,012 - INFO - worker #7 received job #2
2019-01-24 15:31:26,013 - DEBUG - acquiring lock 'lock_update' for processjob
2019-01-24 15:31:26,013 - DEBUG - acquired lock 'lock_update' for processjob
2019-01-24 15:31:26,013 - INFO - updating model with new documents
2019-01-24 15:31:26,013 - INFO - using 100 extra samples and 2 power iterations
2019-01-24 15:31:26,013 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-24 15:31:26,506 - INFO - orthonormalizing (500000, 400) action matrix
2019-01-24 15:31:29,215 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:32:28,062 - DEBUG - running 2 power iterations
2019-01-24 15:32:32,380 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:33:32,141 - DEBUG - computing QR of (500000, 400) dense matrix
2019-01-24 15:34:28,524 - INFO - 2nd phase: running dense svd on (400, 2000) matrix
2019-01-24 15:34:29,620 - INFO - computing the final decomposition
2019-01-24 15:34:29,621 - INFO - keeping 300 factors (discarding 11.348% of energy spectrum)
2019-01-24 15:34:42,370 - INFO - processed sparse job of 2000 documents
2019-01-24 15:34:42,374 - DEBUG - releasing lock 'lock_update' for processjob
2019-01-24 15:45:23,253 - DEBUG - acquiring lock 'lock_update' for getstate
2019-01-24 15:45:23,253 - DEBUG - acquired lock 'lock_update' for getstate
2019-01-24 15:45:23,253 - INFO - worker #7 returning its state after 3 jobs
2019-01-24 15:45:23,254 - DEBUG - releasing lock 'lock_update' for getstate
2019-01-24 15:45:24,522 - INFO - worker #7 stopping asking for jobs
2019-01-24 15:46:56,122 - DEBUG - acquiring lock 'lock_update' for reset
2019-01-24 15:46:56,122 - DEBUG - acquired lock 'lock_update' for reset
2019-01-24 15:46:56,122 - INFO - resetting worker #7
2019-01-24 15:46:58,404 - DEBUG - [ Top 10 ]
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:254: size=3433 MiB, count=6, average=572 MiB
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/util.py:448: size=102 MiB, count=1680847, average=63 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:185: size=85.2 KiB, count=155, average=563 B
2019-01-24 15:46:58,405 - DEBUG - <frozen importlib._bootstrap_external>:525: size=63.5 KiB, count=730, average=89 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:65: size=40.8 KiB, count=580, average=72 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/tracemalloc.py:185: size=34.4 KiB, count=629, average=56 B
2019-01-24 15:46:58,405 - DEBUG - /usr/local/lib/python3.7/abc.py:143: size=18.7 KiB, count=187, average=102 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/message.py:96: size=12.4 KiB, count=140, average=91 B
2019-01-24 15:46:58,405 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/Pyro4/core.py:425: size=7968 B, count=52, average=153 B
2019-01-24 15:46:58,406 - DEBUG - /data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/lsimodel.py:255: size=7440 B, count=6, average=1240 B
2019-01-24 15:46:58,537 - DEBUG - releasing lock 'lock_update' for reset

If I let it run for a couple more hours, I'm quite sure it will keep increasing. I will kill it now though and try downgrading to 0.13.4 to see if this issue goes away.

How much gain do you see from using distributed LSI? How many workers / how many hours, compared to a simple serial LsiModel(mm_corpus)?

Quite a bit of gain, but I don't have exact numbers handy. In this run I am using 8 workers, but depending on the other parameters (BOW dimension, number of topics, etc.), I have used as few as 4 as many as 16.

robguinness commented 5 years ago

FYI, downgrading to gensim 0.13.4 also causes downgrades to numpy-1.16.0 and scipy-1.2.0. Just a note that if the problem goes away, it might be hard to trace down the cause, since we won't know if it was a bug in the more recent releases of these packages.

robguinness commented 5 years ago

I got an error when trying to run under gensim 0.13.4:

[2019-01-24 16:52:57,336] [13890] [root] [INFO] {main:312} Start - generating model
[2019-01-24 16:52:57,336] [13890] [gensim.utils] [INFO] {load:247} loading Phraser object from /data/r_and_d/teqmine-ai/data/phraser
[2019-01-24 16:52:57,336] [13890] [smart_open.smart_open_lib] [DEBUG] {smart_open:176} {'uri': '/data/r_and_d/teqmine-ai/data/phraser', 'mode': 'rb', 'kw': {}}
Traceback (most recent call last):
  File "scripts/main_model_create.py", line 337, in <module>
    main()
  File "scripts/main_model_create.py", line 314, in main
    phraser = get_phraser()
  File "scripts/main_model_create.py", line 118, in get_phraser
    phraser = Phraser.load(phraser_path)
  File "/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/utils.py", line 251, in load
    obj = unpickle(fname)
  File "/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/utils.py", line 916, in unpickle
    return _pickle.load(f, encoding='latin1')
AttributeError: Can't get attribute 'original_scorer' on <module 'gensim.models.phrases' from '/data/r_and_d/teqmine-ai/.venv/lib/python3.7/site-packages/gensim/models/phrases.py'>

It seems that the phraser that was built with gensim 3.6.0 is not compatible with 0.14.3. I would have to regenerate it, which for the whole 7 million documents will take awhile. So please give me a few days.

horpto commented 5 years ago

@robguinness I think problem is not in gensim itself, but libatlas. Of course, copy of big array causes a large allocations, but tracemalloc shows allocations of python only. so if memory leak is in other library we will not see this.

robguinness commented 5 years ago

Of course, copy of big array causes a large allocations, but tracemalloc shows allocations of python only. so if memory leak is in other library we will not see this.

@horpto I'm not sure I completely follow you. The memory usage of Python is growing in the worker, as can be seen with tracemalloc in the output above, which is unexpected behavior since the worker should be flushing all the data it holds after each job. It may not be a "memory leak" in the classical sense, but still causes the Python memory usage to grow with each worker job.

piskvorky commented 5 years ago

What do these tracemalloc numbers mean? If it's just a cumulative sum of memory allocated at a particular line, that may be perfectly OK as long as the memory gets released later. Does an increasing number really imply a leak?

What does top / RAM monitor show for RSS RAM when training with MmCorpus in serial mode?

@robguinness Before downgrading, I'd urge you to try training with MmCorpus (it's streamed, fast and doesn't load all docs into RAM). No ShardedCorpus. Especially if you can reproduce the issue in serial LSI mode too, that simplifies testing and replication tremendously.

Phraser seems unrelated to LSI, where is that coming from? Let's try to get to a minimum example, removing moving pieces unrelated to the problem. Just the input (MmCorpus) + training (LsiModel) please.

horpto commented 5 years ago

@robguinness ok, you're right. I've tested tracemalloc on this example:

import tracemalloc as tm

def test():
  return  [i*i for i in range(1_000_000)]

tm.start()

snap1 = tm.take_snapshot()

a=test()

snap2 = tm.take_snapshot()
top_stats = snap2.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
    print(stat)

print("DIFF")
top_stats = snap2.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
    print(stat)

a=test()

snap3 = tm.take_snapshot()
top_stats = snap3.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
    print(stat)

print("DIFF")
top_stats = snap3.compare_to(snap2, 'lineno')
for stat in top_stats[:10]:
    print(stat)

print("DIFF2")
top_stats = snap3.compare_to(snap1, 'lineno')
for stat in top_stats[:10]:
    print(stat)
horpto commented 5 years ago

If turn off distributed mode will memory issue stay? (I think yes, it stays) Then, one of syspicious part of code for me is qr_destroy. I'd mock it, for example, to just return an array with random data without any real calculations.

robguinness commented 5 years ago

I'm out sick today, so I'll probably get back to troubleshooting on Monday. One small comment for @piskvorky...you are right, Phraser is completely unrelated to this. We were unnecessarily loading the serialized Phraser object in our code. I can just get rid of that line completely.

Thanks for the help and suggestions from everyone.

robguinness commented 5 years ago

Hi all, I have tentatively tracked the memory leak down to scipy/numpy. I'm using an older release of both libraries now, and the problem seems to have gone away. I will try to reproduce the problem and narrow down the problem more conclusively, but at the moment, we need to get this model built so the machine is unavailable.

piskvorky commented 5 years ago

No problem. Make sure to replace ShardedCorpus with MmCorpus in any case.