def parallelize_preprocess(func, iterator, processes, progress_bar=False):
iterator = tqdm(iterator) if progress_bar else iterator
if processes <= 1:
return map(func, iterator)
return Parallel(n_jobs=processes)(delayed(func)(line) for line in iterator)
We used the truecaser for some of our corpora with >8M segments. There are some issues when training a truecaser for larger corpora:
joblib.Parallel
causes a huge memory footprint even when used with a single process. i.e. >32GB of memory for our 8M corpus.In our particular case we fixed the problem by using
map
instead ofParallel
for single processes in this function:https://github.com/alvations/sacremoses/blob/f3780b392368ba09106098354aca706f8476cdb6/sacremoses/util.py#L169-L171