Truecaser crashes for large corpora (>8M segments)

pypae commented 5 years ago

We used the truecaser for some of our corpora with >8M segments. There are some issues when training a truecaser for larger corpora:

Using joblib.Parallel causes a huge memory footprint even when used with a single process. i.e. >32GB of memory for our 8M corpus.
The training never seems to stop (cancelled after 24h). The progressbar finishes after about 20minutes.

In our particular case we fixed the problem by using map instead of Parallel for single processes in this function:

https://github.com/alvations/sacremoses/blob/f3780b392368ba09106098354aca706f8476cdb6/sacremoses/util.py#L169-L171

 def parallelize_preprocess(func, iterator, processes, progress_bar=False): 
     iterator = tqdm(iterator) if progress_bar else iterator 
     if processes <= 1:
          return map(func, iterator)
     return Parallel(n_jobs=processes)(delayed(func)(line) for line in iterator)

alvations commented 5 years ago

Thanks @Patdue for reporting this! Let me dig into joblib to see how they handle large files =)

alvations commented 5 years ago

Resolved c.f. #59

hplt-project / sacremoses

Truecaser crashes for large corpora (>8M segments) #55