Open hayesall opened 6 years ago
Current progress is on batflyer/rnlp (parallel)
. I did a short round of testing to estimate the sort of performance gains that we might expect, graphed below.
Both plots were tested on the same corpus and ran on my local machine.
blockSize=1
blockSize=2
.
A large amount of the running time tends to be spent in
parse.makeIdentifiers()
, which is essentially a triple-nestedfor
loop over blocks, sentences, and words.Previously this was "resolved" by wrapping the outer loop with
tqdm
to estimate how long the process would take. This did not actually change anything but likely would make someone feel better about the situation.joblib
may be a viable way to execute the outer loop in parallel:In the short example above, the "Blocks" would in reality be the the list of blocks generated earlier.
foo(block, blockID)
would be something similar to the currentparse.makeIdentifiers()
method, but blockID is passed as a parameter rather than an integer that increments at the end of the outer loop.