Closed robertbuecker closed 3 years ago
As a side note, a marginally faster (arguably more elgant way) would be to use the StringIO
as buffer itself, like this:
def _parse_loop_data(self) -> pd.DataFrame:
with StringIO() as buffer:
while self.crawler.current_line_number <= self.n_lines:
if self.crawler.current_line.startswith('data_'):
break
buffer.write(self.crawler.current_line + '\n')
self.crawler.increment_line_number()
buffer.seek(0)
df = pd.read_csv(buffer, delim_whitespace=True, header=None, comment='#')
return df
Hey Robert,
Big thanks for the heads up and for taking the time to dig a little deeper, it's really much appreciated!
This is frustrating, I was quite happy with the TextBuffer
and TextCrawler
objects :laughing: - I like your StringIO implementation though, I wasn't exactly sure what was in memory and what was being written to disk with that which is why I went down the 'create your own' object route, should've spent time reading up
This highlights a few things
TextBuffer
object should be scrapped at least for loop blocks, probably for everything (or we end up with multiple ways of doing things within the same object which feels bleh)Do you have any experience with testing for performance related things like this? It's not something I've had to do before
Is this performance hit specifically an implementation issue?
def add_newline(self):
self.buffer += '\n'
def add_line(self, line):
self.buffer += f'{line}'
self.add_newline()
Would removing the add_newline
method and replacing add_line
with
def add_line(self, line):
self.buffer = f'{self.buffer}{line}\n'
this seems to indicate that f strings should be faster? The other methods such as add comment could be refactored in a similar way
I'd really appreciate some help knocking this back into shape if you have time? I'm in PhD writing mode at the minute and although sometimes these things are a welcome distraction I should definitely try to focus, at least during the working day :octopus:
I've also yanked 0.4.0 from PyPi to stop others from running into the same issue, this means that the API in the readme is no longer valid... I'll hope nobody complains for now and hopefully if anyone has problems they'll open an issue
Hi Alister,
I've tried your suggestion already earlier (as first sanity check). Bypassing the add_newline
function roughly halves the runtime, but the main issue is not solved (indeed f-strings are fast, though). My guess is that the very granular calls require making copies at various points, but I'm not deep enough into these things... TBH, my suggesting is to double-check whether pure container classes like TextBuffer
and TextCrawler
are really needed. It's a lot of boilerplate code without improving readability a lot, especially for such a focused tool. I can look into it a bit, but also will have to make sure not to get carried away too much from my actual data analysis (rabbit hole).
Anyway - good luck with thesis writing!
Really interesting, looks like a local variable is indeed the way to go then, the more you know!
Thanks, and I hope your data aren't giving you too much trouble!
@brisvag implemented a nice solution using deque
which I didn't know about and which is faster than 0.3.3, really nice!
This let's us keep the crawler/buffer objects and the performance
Hi Alister,
I've noticed that after the refactor in commit
05ac4282
the parsing performance dropped by orders of magnitude (from 0.4 s to 74 s for a rather big data file).After line profiling, the clear offender are the very atomic
add_line
andadd_newline
methods inTextBuffer
which add single lines to the buffer to be read into theDataFrame
s.Bypassing the
TextBuffer
object altogether by replacing theparser._parse_loop_data
method with its pre-refactor equivalent (i.e., keeping the text buffer a plain local variable) restores the performance, with the actualpd.read_csv
requiring the most time. This is consistent with my previous experience with parsing complex (non-.star
) text formats that putting too much abstraction around performance-critical parts might not be a good idea.For now I'm going to stick to the old version, please let me know if I can be of any help.
Robert