Open aflueckiger opened 6 years ago
The complexity (and thus memory requirements) of the algorithm scales with the number of nodes in the graph. For a corpus similar to yours, the number of nodes is dominated by the number of word-types. One pragmatic way to substantially reduce this number is to impose a minimum number of occurrences for each word-type to be included in the graph. This amounts to filtering the large number of low-frequency words which should not contain much information in any case.
I added an optional argument in make_graph(n_min = 10) in order to filter out many of the word types (you can choose other values for n_min). This should allow you to analyze the corpus with 20GB memory.
I am looking into other options to reduce memory usage in the future.
Best, Martin
Thanks for the quick answer and the code amendment. However, the approach of reducing size reasonably doesn't help unless I stick to toy models. More specifically, I filtered out words already so drasticallyby myself that the remaining corpus is no longer of any practical use. I set extreme thresholds for the term-document frequency (min_df=2000, max_df=0.7), which gives me the following numbers.
I wonder how you processed the Twitter corpus that you listed in your paper (docs: 10,000, types: 12,258, words:196,625). Something on my site doesn't seem to be in line with these result. Hence, the following questions: Are the edges (e.g. tokens) actually not of any significance concerning complexity (magnitude higher here)? Otherwise, I can only think about something that doesn't work properly in the current Docker build unless you used a performant server for these computations. Do I miss something?
You are right, the memory also depends on the number of edges (word tokens). Therefore, for a very large corpus, you would probably need to run the code on a cluster.
However, the Twitter-corpus we used in the paper takes <~2GB of memory, so this should be no problem with a 20GB memory. I tried to construct a similar corpus in size as you described by taking 10,000 books from Project Gutenberg. Setting a minimum threshold of 10-100 occurrences for each word yields a manageable corpus that I can run on my laptop.
So I am not completely sure where your problem is stemming from since I havent used the Docker build. Perhaps you can find additional information on the graph-tool gitlab Issues https://git.skewed.de/count0/graph-tool/issues
Beyond all this, Tiago (graph-tool) suggested to try the following to reduce the memory usage (which is not yet implemented in the topic-model wrapper):
The main reason why minimize_nested_blockmodel_dl() uses a lot of RAM is because it keeps several simultaneous copies of the global state when it's searching for the optimal number of groups. Instead, a much more economical approach is to use a single state the whole time. You do this by initializing a singe NestedBlockState, where each node belongs to its own group at first. You then equilibrate a MCMC by calling repeatedly:
state.multiflip_mcmc_sweep(niter=n)
with n=10 or 100 so. This attempts to merge and split groups, so it gives good results. In practice, minimize_nested_blockmodel_dl() still gives better results in many cases, but they are comparable. But with this approach the memory use is far lower (factor 3 to 4, or so). (Remember that using weighted graphs, via the 'eweight' parameter, will also help reduce memory use.)
I really appreciate your support. I installed graph-tool locally and did some further experiments. The RAM consumption is still incredibly high and seems to differ from your observations. Do I understand correctly that you have successfully processed 10,000 books with just filtering by term frequency n<100 on a laptop? This must have resulted in a much bigger corpus than I try to model.
I managed to process an extremely filtered corpus successfully (min_df=1000, max_df=0.6), however, it still takes 17GB of RAM. Thus, something is still not in line.
Currently, I don't have the expertise to readily extend the code following your additional suggestions as I work primarily in the field of computational linguistics and computational social science. I would be happy to test such an implementation though.
Moreover, can you share the script (incl. corpus loading from nltk?) of your experiment to make sure we are on the same page? :)
Here is a notebook in which I analyze 1000 books from Project Gutenberg with a minimum threshold n_min=100 (so only words that occur at least 100 times are included in the corpus. This takes about 1GB and runs for a few minutes. I guess this would easily work with 10,000 documents but I didnt want to wait.
I am looking into other methods to further reduce memory usage (along the lines of what I wrote above) and hope to be able to include some of these features at some point.
Let me know if you have further questions. dev_20181017_demo-for-PG-corpus.html.tar.gz
The corpus you have tested with is quite different compared to mine. In your example, only 1000 documents are considered with a maximal length of 1000 tokens each, which results in a relatively small corpus. Moreover, a minimum term frequency of 100 is very restrictive given the corpus has only 1000 documents (a word needs to occur in every 10th article on average, thus, smaller "topics" disappear). For practical applications, these requirements are not appropriate.
Modeling bigger corpora may be not possible with the current implementation unless one has access to a cluster. The used algorithm has a complexity of O(Vln2V) that explains the memory issue with the bigger corpus. https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference.minimize.minimize_nested_blockmodel_dl
I only see the following options:
Sampling just a number of words for each document would jeopardize the explanatory power. Thus, this is not an option for my purposes.
Do you have other suggestions? Thank you.
I wonder whether there is scope to implement an alternative inference algorithm for hSBM such as the variational approach of Park and Bader (https://arxiv.org/pdf/1711.05150.pdf)
With a recent graph-tool version (2.43 or above) the memory requirements should be significantly improved (as well as speed). Please try again!
Thanks for this new intriguing method. A few toy models showed interesting and consistent patterns so I would like to use it with a real corpus. When applying this method to a medium-sized corpus, however, I run into memory issues (20GB RAM + 20 GB swapfile). The properties of the corpus are as follows:
docs (unfiltered): 7897
tokens (unfiltered): 9273368
types (unfiltered): 67485
Following your paper, the algorithm should be scalable. Yet, even when drastically reducing the size of the graph by filtering out nodes (docs + words) as well as edges (tokens) I still encounter the same issues.
What's the exact issue? Is there a way to make it more scalable?
Edit: I use the graph tool Docker container