Open jdvala opened 2 years ago
I can help you do that if you agree.
Hey @jdvala, this is good idea. I would suggest to use Python's multiprocessing, e.g. with a pool. What's your opinion on this?
Hi @jfilter I have a few question that I would like to discuss before starting to implement this. If we enable multiprocessing we need to have a list of text and not just text, currently the clean function only excepts str.
clean
functions?Or do we make changes to the clean function?
I would recommend to go for the second option as people have gotten used to the current signature of the function and changes we change this, so in my opinion we have clean_parallel
function which calls the clean
function.
Secondly, if a single text is large enough, then breaking it and parallelizing it also makes sense.
At this point I am confused as which should we implement first.
Hey @jdvala, in my opinion, the clean function should also accept a list of texts and then return a list of processed texts.
Then, we need a new parameter, e.g. n_jobs
, to specify the number of maximum parallel jobs. This is how joblib is doing it. We may also use joblib to do the multiprocessing. Or take a look at https://github.com/Slimmer-AI/mpire since working with Python's multiprocessing feels clunky.
Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.
It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.
What do you think?