Memory Usage for Datasets GDELT and WIKI

liu-yushan / TLogic

Apache License 2.0

48 stars 17 forks source link

Hello, I was trying to run TLogic on the datasets GDELT and WIKI (GDELT (Lee- taru and Schrodt, 2013); WIKI (Leblay and Chekol, 2018)) , (can be found e.g. here: https://github.com/INK-USC/RE-Net/tree/master/data).

GDELT WIKI 1734399 539286 num_triples_train
238765 67538 num_triples_valid 305241 63110 num_triples_test
2975 231 num_timesteps

For both Datasets I have issues when running apply.py. Either I have memory issues when running multiple processes ( >1.5TB for e.g. 10 processes for WIKI, with windowsize=200), or I have runtime issues (e.g. no results after >20 days for 1 process for WIKI, with windowsize=200). I am able to get results for WIKI with 8 processes with windowsize=10.

Have you tried running TLogic on these datasets?

Is there any other parameter that I could set, to improve memory usage (besides less processes which leads to very very long runtime, and very small window sizes, which potentially leads to lower scores)?

This question is especially important, when I want to try multistep prediction (ie. w=-1), because for this I cannot set the windowsize, and thus will always run into memory issues.

Do you know any solution to this problem? Or are these datasets simply not suited for TLogic?

Kind regards and thank you for your reply. Julia

Hi Julia,

we did not try running TLogic on these two datasets. Besides running multiple processes and decreasing the window size, you could check the performance for shorter rules (lengths 1 and 2) or decrease the number of minimum candidates (which is probably already rather small if you did not change the default value).

Since TLogic finds all possible body matchings, there might be scalability issues with large (or dense) datasets so that direct application might not be feasible. One solution could be to only apply the top-k rules or filter the rules based on minimum confidence or minimum body support (see apply.py line 42). One other solution could be to take a subset (e.g., through sampling) of the matching edges (modification to get_window_edges, match_body_relations, or match_body_relations_complete in rule_application.py necessary).

For multi-step prediction, I believe that you can also set a fixed time window, possibly increasing the time window with each prediction step.

liu-yushan / TLogic

Memory Usage for Datasets GDELT and WIKI #2