Open ENuge opened 8 years ago
Without having looked too deeply into the code, I think it could be possible to split the calls inside extract_from_dir(..) to extract_from_file(..) into multiple processes, one per filename (somewhere near here: https://github.com/python-babel/babel/blob/master/babel/messages/extract.py#L143).
Poked at the extraction code, wanted to get a sense of how much time is spent scanning and extracting strings vs actually writing them to the new file.
Hypothesis (and hope): scanning/extracting takes significantly longer. This part of the process we can parallelize. We can't (easily) have concurrent writes to the same output file.
Results: Total time extracting: 159.926753521 secs Total time writing: 0.707601547241 secs Code I ran: http://pastebin.com/5uSC15MN
Thanks for the numbers, looks like an interesting case for optimisation. Once our workflow is a bit smoother, I will take a look on it.
A large repo I am working with currently takes ~30 seconds to build using xgettext (https://www.gnu.org/savannah-checkouts/gnu/gettext/manual/html_node/xgettext-Invocation.html). The same scan with pybabel extract takes ~2 min 49 secs.
The extraction process is, in both cases, blacklist-based, with the same directories blacklisted. That is, we scan the entire repo for new messages except those in a list of [ignore: foo/**.py] blocks.