Closed bschollnick closed 5 years ago
Taking a look, and I don't see any obvious optimization for the .getvalue function. It's a simple ''.join(self.result) so it can't be too much simpler...
That being said, if result is ~ 2143 entries it takes about 45 seconds to build the final document. If result is ~5559 entries, then it's more than 2+ minutes...
Okay, it does not appear to the getvalue, instead it's indent.
On a 3000 line table, without indent, getvalue timing - .0349, 0.03800, .02105 With Indent - 83.7287, 86.50575
So a dramatically slow performance with Indent being used....
That's what I thought, it has to come from the indent function, because, as you said, getvalue()
just does ''.join(self.result)
, so it shouldn't take much more time than to build the document using the tag, text, and asis function (and you say this step was fast).
The indent function was written in pure Python, maybe you can find a library that indents faster. That's if you really need to indent that table.
I guess a 3000 rows table isn't intended to be read by humans in HTML anyway, so maybe you could just remove the indentation step. If you find a better library for indentation, that would be interesting to know how fast it is on the same entry you're testing with here. If the difference is really dramatic, maybe I'll look into my code to see if I can improve the indent function. Or maybe I'll just shamelessly import from their library to replace the indent function of Yattag.
I understand that you might be dealing with private data here, so you probably can't link the exact sample you're dealing with. But could you try to link an example of a html table that would be that slow to indent? I'm curious because I just indented a 3000 rows table in a fraction of second.
Strange... It's very repeatable here with the slow down... If I have a chance I'll see if I can either mock up a fake data set, or see if I can isolate this further...
How large (File size) was your test file? My current file is ~2.8Mb, and I expect it to get larger...
Windows 10 (64bit), using Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32.
Machine Specs - i5-6300U @ 2.4Ghz w/8 Gb of ram.
It was a smaller file (200k, generated with Yattag, so without any space between the tags). I just checked with a 2.8Mb file and it is indeed slow (about 8 seconds). I did some profiling and I think I see where the problem comes from (on line 160 of indentation.py, the tokenizer creates a new string each time it correctly identifies a token and progress further in the text). We should just increase an integer instead, probably. I will try to publish a fix soon.
I did what I said and it's faster now. About 8 to 10 times faster. From my few tests it's now faster at indenting than pup (html command line tool). It's also faster than if you use xml.dom.minidom.parseString(content).toprettyxml()
. I don't have the time to run benchmarks against other tools now.
You can pip install --upgrade yattag
to get the new version.
Folks,
I'm using the following code... with a document that has several smaller tables, and one extremely large table (eg. 5341 table rows, with 12 columns). It's building the document extremely fast, but when I call the doc.getvalue() it's taking 15 seconds with a 1000 rows, about 45 seconds with 2500 rows, and I haven't been patient enough to time the 5000+ row table (at least 2+ minutes).
The above, is a example of a different smaller table, the extractor that I'm writing is dealing with hipaa related content...
I'm going to start digging around the code to see if I can see any good optimizations for yattag, but any assistance would be appreciated. Is this a known issue? Any recommended workarounds?