jfilter / clean-text

🧹 Python package for text cleaning
Other
946 stars 78 forks source link

Add multiprocessing #20

Open jdvala opened 2 years ago

jdvala commented 2 years ago

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

jdvala commented 2 years ago

I can help you do that if you agree.

jfilter commented 2 years ago

Hey @jdvala, this is good idea. I would suggest to use Python's multiprocessing, e.g. with a pool. What's your opinion on this?

jdvala commented 2 years ago

Hi @jfilter I have a few question that I would like to discuss before starting to implement this. If we enable multiprocessing we need to have a list of text and not just text, currently the clean function only excepts str.

Secondly, if a single text is large enough, then breaking it and parallelizing it also makes sense.

At this point I am confused as which should we implement first.

jfilter commented 2 years ago

Hey @jdvala, in my opinion, the clean function should also accept a list of texts and then return a list of processed texts.

Then, we need a new parameter, e.g. n_jobs, to specify the number of maximum parallel jobs. This is how joblib is doing it. We may also use joblib to do the multiprocessing. Or take a look at https://github.com/Slimmer-AI/mpire since working with Python's multiprocessing feels clunky.