masaccio / numbers-parser

Python module for parsing Apple Numbers .numbers files
MIT License
201 stars 14 forks source link

_NumbersModel.table_string_key degrades performance #45

Closed takaakifuji closed 1 year ago

takaakifuji commented 1 year ago

Thanks for maintaining the useful parser!

https://github.com/masaccio/numbers-parser/blob/8dcb81db6fd055ed1b6cfb8b4527435a0de227b6/src/numbers_parser/model.py#L226

When handling a gigantic table, the line where generates a dictionary on every call extremely degrades the performance. Also the line key = max(string_lookup.values()) + 1 seems very expensive as well.

class TableStringsStore(object):
    def __init__(self, entries):
        self.entries = entries
        self.lookup = {x.string: x.key for x in entries}
        self.max_value = max(self.lookup.values()) if self.lookup else 0
    def append_entry(self, entry):
        self.entries.append(entry)
        self.lookup[entry.string] = entry.key
        self.max_value = max(self.max_value, entry.key)
    def clear_field_container(self):
        clear_field_container(self.entries)
        self.lookup = {}
        self.max_value = 0

This is just a sketch, but incrementally updating the string lookup dict and the max value via the wrapper like above did drastically improve the performance when saving a file. Annotating table_string_key() with @lru_cache also looked effective.

masaccio commented 1 year ago

Many thanks for spotting that. I've refactored how I manage datalists which are used in various places and for the strings I should now be efficiently cacheing. There are other performance issues I am aware of in saving files but I've not done profiling on that yet.

Do keep the feedback coming - I'm very happy to see someone using the save functions.

You should see the improvements in 3.9.3.

takaakifuji commented 1 year ago

I'm very looking forward to see the improvement and the upcoming release! Another low-hanging fruit I found was to utilize cache in iwafile.find_references (called from containers.ObjectStore.update_dirty_objects). For me the saving feature is fascinating because I think Numbers is a nice tool to present tabular data less cluttered than Excel. I appreciate your effort and keep up the good work!

masaccio commented 1 year ago

Yes update_dirty_objects is not overly smart. It is also overly greedy in computing new dependencies rather than tracking what's actually dirty.