python-babel / babel

The official repository for Babel, the Python Internationalization Library
http://babel.pocoo.org/
BSD 3-Clause "New" or "Revised" License
1.3k stars 433 forks source link

Message extraction is slow for sufficiently large repos (consider parallelizing) #253

Open ENuge opened 8 years ago

ENuge commented 8 years ago

A large repo I am working with currently takes ~30 seconds to build using xgettext (https://www.gnu.org/savannah-checkouts/gnu/gettext/manual/html_node/xgettext-Invocation.html). The same scan with pybabel extract takes ~2 min 49 secs.

The extraction process is, in both cases, blacklist-based, with the same directories blacklisted. That is, we scan the entire repo for new messages except those in a list of [ignore: foo/**.py] blocks.

ENuge commented 8 years ago

Without having looked too deeply into the code, I think it could be possible to split the calls inside extract_from_dir(..) to extract_from_file(..) into multiple processes, one per filename (somewhere near here: https://github.com/python-babel/babel/blob/master/babel/messages/extract.py#L143).

ENuge commented 8 years ago

Poked at the extraction code, wanted to get a sense of how much time is spent scanning and extracting strings vs actually writing them to the new file.

Hypothesis (and hope): scanning/extracting takes significantly longer. This part of the process we can parallelize. We can't (easily) have concurrent writes to the same output file.

Results: Total time extracting: 159.926753521 secs Total time writing: 0.707601547241 secs Code I ran: http://pastebin.com/5uSC15MN

etanol commented 8 years ago

Thanks for the numbers, looks like an interesting case for optimisation. Once our workflow is a bit smoother, I will take a look on it.