srlearn / rnlp

Relational NLP: Convert text into relational facts.
https://rnlp.readthedocs.io/en/latest/
GNU General Public License v3.0
9 stars 5 forks source link

Parallelism for makeIdentifiers #11

Open hayesall opened 6 years ago

hayesall commented 6 years ago

A large amount of the running time tends to be spent in parse.makeIdentifiers(), which is essentially a triple-nested for loop over blocks, sentences, and words.

Previously this was "resolved" by wrapping the outer loop with tqdm to estimate how long the process would take. This did not actually change anything but likely would make someone feel better about the situation.


joblib may be a viable way to execute the outer loop in parallel:

from joblib import Parallel, delayed
from tqdm import tqdm

def foo(block, blockID):
    """
    :param block: The current block to be processed (list of lists).
    :param blockID: Index of the current block (int).
    """
    return [blockID]

Blocks = list(range(5000))
facts = Parallel(n_jobs=-1)(delayed(foo)(Blocks[i], i) for i in tqdm(range(len(Blocks))))

In the short example above, the "Blocks" would in reality be the the list of blocks generated earlier. foo(block, blockID) would be something similar to the current parse.makeIdentifiers() method, but blockID is passed as a parameter rather than an integer that increments at the end of the outer loop.

hayesall commented 6 years ago

Current progress is on batflyer/rnlp (parallel). I did a short round of testing to estimate the sort of performance gains that we might expect, graphed below.

Both plots were tested on the same corpus and ran on my local machine.

time_vs_cores