hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Truecaser crashes for large corpora (>8M segments) #55

Closed pypae closed 5 years ago

pypae commented 5 years ago

We used the truecaser for some of our corpora with >8M segments. There are some issues when training a truecaser for larger corpora:

In our particular case we fixed the problem by using map instead of Parallel for single processes in this function:

https://github.com/alvations/sacremoses/blob/f3780b392368ba09106098354aca706f8476cdb6/sacremoses/util.py#L169-L171

 def parallelize_preprocess(func, iterator, processes, progress_bar=False): 
     iterator = tqdm(iterator) if progress_bar else iterator 
     if processes <= 1:
          return map(func, iterator)
     return Parallel(n_jobs=processes)(delayed(func)(line) for line in iterator) 
alvations commented 5 years ago

Thanks @Patdue for reporting this! Let me dig into joblib to see how they handle large files =)

alvations commented 5 years ago

Resolved c.f. #59