Do some super large file testing

jandrew commented 9 years ago

For now 50MB+ is a goal that shouldn't crash the reader. I probably won't add tests to the distribution. Just record the results and possible techniques here.

[ ] test .xlsx (zip) style files
[ ] test .xml (Excel 2003 xml format) style files.
- I'm not sure yet where to find a file like this

morungos commented 9 years ago

Can speed be part of this too? I'm getting slow performance compared to Spreadsheet::XLSX. I'm handling ~2000 rows of ~50 columns, and performance seems to be progressively degrading as the run continues. Total runtime is ~2+ magnitudes slower than Spreadsheet::XLSX (i.e., these files take hours just to read all the cells with :just_the_data. I did a little Devel::NYTProf on it, and it did look like a surprisingly large proportion of the time was in ::Worksheet::get_format_position, and a fair bit in ::XMLReader::start_the_file_over, which seems an odd pattern.

Sadly I can't send the data: it's confidential, but I don't think it's all that unusual.

jandrew commented 9 years ago

Stewart,

Are your large files primarily numerically based or text based? I would probably solve the speed issue two different ways depending on the data you are reading.

::XMLReader::start_the_file_over is called a lot because of the non-sequential nature of the meta-data stores. Solving this one requires reading the whole file into memory as a perl structure and then querying it rather than parsing it real time. Initially his was an intentional initial design because as I mention in the issue statement above I have had RAM issues with large sheets for both Spreadsheet::XLSX and Spreadsheet::ParseXLSX. Long term I plan to support big and slow as well as smaller and faster parsing. Ultimately, based on your available machine RAM, big and not crashing will almost certainly require slow. Yes, I know that Excel files shouldn't be built like that but I run into them in the wild all the time. On the other hand I did do some work for the release of version v0.38.2 that cached the shared strings file in a JIT fashion so it was only read once. This was set as the default for the flag :just_the_data in version v0.38.6. I will add additional caching to the request as well.

::Worksheet::get_format_position is called each time the coerced value (from unformatted) is requested (as it is in :just_the_data. If you are parsing text only or are OK with raw numbers I will add another flag :just_raw_data in v0.38.16 that will avoid this call. However, this is probably the next candidate (#69) for caching and you are not the only one with a speed request. (it won't be ready for v0.38.16 though)

I also seem to remember that the test file you sent me last was full of a number empty cells that had formatting but no values and I have made an attempt to skip those. However, I'm currently chasing a bug where some of these cells are still not being skipped in my data so if I can figure out why and start skipping them that should provide a speed boost too. I'm hoping to roll that to v0.38.16 since I'm root causing it now.

morungos commented 9 years ago

It's almost entirely text, but there are a fair number of dates too, and these are probably numeric underneath somewhere.

What's puzzling me is that the file isn't really all that big, certainly not 50Mb. it's probably 20000 actual data-containing-cells at most.

I completely understand the large file issue: I've had this before. One trick I have used is to actually use a twig parser to scan everything, and then to build an index of the leaves of interest (probably cells) in a string form, certainly not as DOM elements, and then re-parse them as needed, though I''d not usually use XML for this (often I've just used a Perl Storable/freeze/thaw). I don't think Perl+DOM is friendly to memory, ever.

And yes, I've just come across lots of blank rows in input data too, sorry for that, but I like to leave my input data in whatever form I get, even if it is broken. I had the exact same issue: they weren't being skipped as this was my old ::XLSX script and it was leaking XML elements into the values!! So I'm desperate to migrate away.

jandrew commented 9 years ago

OK so it sounds like the flag :just_raw_data won't give you any boost since you have dates in your data but I think I will leave it in the plan anyway since the cost is low.

I suspect that the next two fixes #69 caching the Styles file and the last (currently unlisted) bug where blank but formatted cells are not always being skipped will give you the biggest boost. They will likely be implemented in reverse order with #69 coming with release v0.40.2 since I am reworking the guts for that one anyway. If your sheets has a lot of blank cells the blank cell skip fix should boost speed too since it should short circuit the file depth of read at a much earlier point.

jandrew commented 9 years ago

I just released v0.38.16 with caching for the Styles file and row parsing rather than just cell level parsing. Both of them should clean up some of the older cruft as well as give a long term speed boost. As a note the initial open time could be a bit slower as a consequence.

jandrew commented 8 years ago

While I haven't really started working on this issue I have logged issue #93 as one noticed large file processing problem.

jandrew / Spreadsheet-XLSX-Reader-LibXML

Do some super large file testing #14