jmcnamara / excel-writer-xlsx

Perl module to create Excel XLSX files.
https://metacpan.org/pod/distribution/Excel-Writer-XLSX/lib/Excel/Writer/XLSX.pm
Other
100 stars 51 forks source link

Excessive memory use apparently occurs when write_string() is used with a "forced" Text cell format #266

Closed marcschwartz closed 3 years ago

marcschwartz commented 3 years ago

Hi,

Within a larger R language related package, I have a Perl sub, called use_write_string(), that is defined within a larger Perl script, where the sub is added via add_write_handler() in the parent routine:

$WorkSheet->add_write_handler(qr[\w], \&use_write_string);

which is then used in the course of calling:

$WorkSheet->write($Row, $Column, $Fld);

within the row/column nested loop over the source data structure content.

The sub is:

sub use_write_string {

  my $worksheet = shift;
  my $token     = $_[2];

  # Add text format for use by write_string()
  my $text_format = $XLSFile->add_format(num_format => '@'); 

  # use this all the time
  if ($AllText eq "TRUE") {
    return $worksheet->write_string(@_, $text_format);

  # single leading zero followed by single decimal point and numbers only
  # e.g. 0.1234
  # write as number, since Excel will keep the zero.
  # return control to write();  
  } elsif ($token =~ /^0\.[0-9]+$/) {
    return undef;

  # anything with a leading zero followed by any
  # characters, other than a single decimal as above
  # since Excel will strip the leading zero, if
  # it can be converted to a valid number  
  # e.g. 01234 (zip code), 00, 01234.1234 or other identifiers
  } elsif ($token =~ /^0.+$/) {
    return $worksheet->write_string(@_, $text_format);

  # trailing zeroes preceded
  # by any digits only, write as an integer
  # e.g. 12340, 123400, 120340
  # Return control to write();  
  } elsif ($token =~ /^[0-9]+0+$/) {
    return undef;

  # trailing zeros after a decimal point
  # which Excel will strip to an integer
  # e.g. 1234.0, 1234.00, .0  
  } elsif ($token =~ /^.*\.[0-9]*0+$/) {
    return $worksheet->write_string(@_, $text_format);

  # else return control to write();  
  } else {
    return undef;
  }
}

The above sub is used in lieu of setting keep_leading_zeros(), as this setting does not behave in a consistent manner with a variety of erstwhile numeric formats that I deal with in the source data, where that content needs to be written 'as is', as text, when written to an Excel file, rather than being converted to numeric by default, stripping needed characters from the content.

I made a recent change to the above sub, as content was being written to the Excel worksheet cells as a "General" format, versus explicitly as a "Text" format, and this was causing a problem in certain use cases for package users.

The key change was the specification within the sub of:

my $text_format = $XLSFile->add_format(num_format => '@');

and then the use of:

return $worksheet->write_string(@_, $text_format);

as seen above, instead of:

return $worksheet->write_string(@_);

where the former line now forces the relevant content to be written to "Text" formatted cells.

A user has just reported that using the new version of the R package that has no changes other than the above cell format change in the Perl script sub, that in the case of a source data structure, which has 200,000 rows and 2 columns, resulting in an Excel file that is around 3 Mb in size, that memory use by Perl has increased from a peak of around 300 Mb, to now around 2 Gb on their Linux based system, and I have replicated this now on my macOS system.

I am going to presume that I have done something wrong here, rather than this being a bug, which is resulting in this substantial relative increase in memory requirements for such a small dataset. However, I am not clear on what that might be and looking for any guidance that can help resolve the issue.

Please let me know if you need more information, if the issue is not clear.

Thanks!

jmcnamara commented 3 years ago

Hi Marc,

I'd guess that issue is that the following is creating a new format object each time the function is called:

my $text_format = $XLSFile->add_format(num_format => '@');

Excel::Writer::XLSX removes duplicate formats but only when the file is closed/saved. So the memory use will increase with each function call.

You'll need to make the variable static to the function by moving the initialization out of the function (the simplest fix) or by declaring it with "state" which would require perl 5:10.

P.S. I'm sending this from my phone, I'll check in more detail later.

John

marcschwartz commented 3 years ago

Hi John,

Thanks for your expedient reply here.

You appear to be correct. I have moved the above line of code outside of the sub, and near the initialization of $XLSFile in the main routine.

I had not considered that a persistent format object would be created with each cycle of the nested loops, allocating more memory for each new object with each cycle. Your explanation makes all the sense in the world.

With this change, the memory utilization, curiously, actually went down relative to the earlier version, using the same example code, and is around 215 Mb now.

Thanks for your quick reply and assistance here!

I will close the issue with this comment.

jmcnamara commented 3 years ago

Good stuff Marc. I'm glad that worked. Coincidentally, I have been using R a bit recently. Although not to generate any Excel files. :-)

marcschwartz commented 3 years ago

Hi John,

Glad to hear that you are using R.

I am a biostatistician, involved in clinical research, and have been using R for around 20 years, as it is my primary analytic tool.

The R package in question is WriteXLS:

https://cran.r-project.org/web/packages/WriteXLS/index.html

I created it some years ago, when I was running on Fedora Linux, to enable cross-platform exports of R data frames to Excel files, which was a request by numerous clients and it has evolved over time.

I will be submitting what will become version 6.3.0 yet today, with the bug fix, and it should begin to show up on the CRAN network in the next few days.

Thanks again for your expedient help here!

jmcnamara commented 3 years ago

That's good. I'm happy to see Excel::Writer::XLSX getting some use in R. Thanks for the work on that.

I also wrote a C version of Excel::Writer::XLSX called libxlsxwriter that someone wrapped into a similar R library: https://docs.ropensci.org/writexl/

I also wrote a Python version called XlsxWriter which I helped integrate into Python Pandas. Pandas has a option to expose the underlying workbook and worksheets that it is using so users can take advantage of features such as Charts, Conditional Formatting and Tables. See for example: Working with Python Pandas and XlsxWriter. That probably isn't feasible, or at least easy, in R but you might be interested in it.

Anyway, let me know if you need any help with your R library in future.

marcschwartz commented 3 years ago

Hi John,

Thanks for the additional information. I am aware of some of the other implementations in R for reading and/or writing Excel files, and most depend upon Python or Java, as an alternative to Perl. There is also the so-called Tidyverse implementation, which I believe depends upon C/C++, and tends to be popular with RStudio users, which is one of the more popular add-on R GUIs.

Much of the differentiation in the R Excel related implementations comes down to the exporting side of the flow, and the Excel functionality that is supported, including the formatting of cells, the manipulation of Excel objects and similar.

As my focus has been narrowly on supporting the cross-platform exporting of R data frames to Excel files, primarily for the portability of R datasets to non R users, I have stayed away from supporting unrelated functionality, like more extensive cell formatting, graphics, tables, Excel formulae, etc.

The XLConnect package, is one such CRAN package that does support an extended feature set, using Java:

https://cran.r-project.org/web/packages/XLConnect/index.html

There is a cross-over, to some extent, between R and Python in some circles, with some users taking advantage of both environments as may be apropos. There have been comparisons between the R Tidyverse environment and Python Pandas for data management related tasks, and a Google search reveals numerous examples. I don't use the Tidyverse functionality, given that my use of R long predates it and I have developed my own workflows. However, along with RStudio, it is more popular with newer users.

I appreciate your offer of assistance and will certainly reach out in the future!