Performance after refactor

robertbuecker commented 3 years ago

Hi Alister,

I've noticed that after the refactor in commit 05ac4282 the parsing performance dropped by orders of magnitude (from 0.4 s to 74 s for a rather big data file).

After line profiling, the clear offender are the very atomic add_line and add_newline methods in TextBuffer which add single lines to the buffer to be read into the DataFrames.

Bypassing the TextBuffer object altogether by replacing the parser._parse_loop_data method with its pre-refactor equivalent (i.e., keeping the text buffer a plain local variable) restores the performance, with the actual pd.read_csv requiring the most time. This is consistent with my previous experience with parsing complex (non-.star) text formats that putting too much abstraction around performance-critical parts might not be a good idea.

For now I'm going to stick to the old version, please let me know if I can be of any help.

Robert

robertbuecker commented 3 years ago

As a side note, a marginally faster (arguably more elgant way) would be to use the StringIO as buffer itself, like this:

    def _parse_loop_data(self) -> pd.DataFrame:

        with StringIO() as buffer:
            while self.crawler.current_line_number <= self.n_lines:
                if self.crawler.current_line.startswith('data_'):
                    break
                buffer.write(self.crawler.current_line + '\n')
                self.crawler.increment_line_number()

            buffer.seek(0)
            df = pd.read_csv(buffer, delim_whitespace=True, header=None, comment='#')

        return df

alisterburt commented 3 years ago

Hey Robert,

Big thanks for the heads up and for taking the time to dig a little deeper, it's really much appreciated!

This is frustrating, I was quite happy with the TextBuffer and TextCrawler objects :laughing: - I like your StringIO implementation though, I wasn't exactly sure what was in memory and what was being written to disk with that which is why I went down the 'create your own' object route, should've spent time reading up

This highlights a few things

tests should be added for performance to enable detection of this type of problem during testing
the TextBuffer object should be scrapped at least for loop blocks, probably for everything (or we end up with multiple ways of doing things within the same object which feels bleh)

Do you have any experience with testing for performance related things like this? It's not something I've had to do before

Is this performance hit specifically an implementation issue?

    def add_newline(self):
        self.buffer += '\n'

    def add_line(self, line):
        self.buffer += f'{line}'
        self.add_newline()

Would removing the add_newline method and replacing add_line with

    def add_line(self, line):
        self.buffer = f'{self.buffer}{line}\n'

this seems to indicate that f strings should be faster? The other methods such as add comment could be refactored in a similar way

I'd really appreciate some help knocking this back into shape if you have time? I'm in PhD writing mode at the minute and although sometimes these things are a welcome distraction I should definitely try to focus, at least during the working day :octopus:

alisterburt commented 3 years ago

I've also yanked 0.4.0 from PyPi to stop others from running into the same issue, this means that the API in the readme is no longer valid... I'll hope nobody complains for now and hopefully if anyone has problems they'll open an issue

robertbuecker commented 3 years ago

Hi Alister, I've tried your suggestion already earlier (as first sanity check). Bypassing the add_newline function roughly halves the runtime, but the main issue is not solved (indeed f-strings are fast, though). My guess is that the very granular calls require making copies at various points, but I'm not deep enough into these things... TBH, my suggesting is to double-check whether pure container classes like TextBuffer and TextCrawler are really needed. It's a lot of boilerplate code without improving readability a lot, especially for such a focused tool. I can look into it a bit, but also will have to make sure not to get carried away too much from my actual data analysis (rabbit hole). Anyway - good luck with thesis writing!

alisterburt commented 3 years ago

Really interesting, looks like a local variable is indeed the way to go then, the more you know!

Thanks, and I hope your data aren't giving you too much trouble!

alisterburt commented 3 years ago

@brisvag implemented a nice solution using deque which I didn't know about and which is faster than 0.3.3, really nice!

This let's us keep the crawler/buffer objects and the performance

teamtomo / starfile

Performance after refactor #11