python-babel / babel

The official repository for Babel, the Python Internationalization Library
http://babel.pocoo.org/
BSD 3-Clause "New" or "Revised" License
1.31k stars 438 forks source link

Sharing message IDs between catalogs #571

Open Changaco opened 6 years ago

Changaco commented 6 years ago

We use Babel's read_po() function to load our webapp's translations, and we've realized that doing it this way means that source strings (a.k.a. message IDs) are stored multiple times in memory, while they should be stored only once since they're common between PO files. With large numbers of messages and catalogs this can result in significant RAM consumption.

We fixed this inefficiency by creating the following share_source_strings function:

def share_source_strings(catalog, shared_strings):
    """Share message IDs between catalogs to save memory.
    """
    if not shared_strings:
        shared_strings.update((m.id, m.id) for m in catalog)
        return
    for m in list(catalog):
        if not m.id:
            continue
        if m.id in shared_strings:
            m.id = shared_strings[m.id]
            catalog.delete(m.id)
            catalog[m.id] = m
        else:
            shared_strings[m.id] = m.id

and calling it after each read_po():

source_strings = {}
for f in po_files:
    catalog = read_po(f)
    share_source_strings(catalog, source_strings)
    ...
del source_strings

Maybe a similar mechanism could be integrated into Babel so that memory usage would be optimized by default? If not, then a note could be added in the documentation about how to optimize the memory footprint of catalogs.

akx commented 6 years ago

Interesting! Do you have any benchmarks as to how much memory this is actually saving?

Also, dunno if it'd help, but there's the sys.intern() function too.

Changaco commented 6 years ago

Currently our app has 29 catalogs containing 1179 message IDs each, and get_size() tells me that the keys of one catalog use 306kB of memory, so sharing them saves 8.58MB.

sys.intern() is a great suggestion, it could speed up message lookups as a bonus. It must have a small memory cost though, since CPython needs to keep track of interned strings, whereas our code above throws away the source_strings dictionary as soon as we've loaded all the PO files.

Changaco commented 6 years ago

I've just tried intern(), it's not a drop-in replacement because it's limited to strings, whereas message IDs can be tuples of strings.

Edit: moreover intern() only supports one type of string (str, different across python versions).