pyexcel / pyexcel-xlsx

A wrapper library to read, manipulate and write data in xlsx and xlsm format using openpyxl
Other
114 stars 31 forks source link

file descriptors are not closed #14

Closed benoit-pierre closed 7 years ago

benoit-pierre commented 7 years ago

For example, when running the following code on Linux:

import os

import pyexcel

pyexcel.get_book_dict(file_name='test.xlsx')

fd_dir = '/proc/%u/fd' % os.getpid()
for fd_name in os.listdir(fd_dir):
    print(fd_name, '-> ', end='')
    try:
        print(os.readlink('%s/%s' % (fd_dir, fd_name)))
    except FileNotFoundError:
        print()

The last file descriptor open point to test.xlsx.

Beside the file descriptor leak, this is really problematic in Windows, as it make it impossible to concurrently modify the spreadsheet in Office when it has been read in another (still running) application.

chfw commented 7 years ago

the close() function should be called for read-only and write-only mode, which were used in this library. will run your code to test a fix.

benoit-pierre commented 7 years ago

I've been trying this, but it's not enough:

 pyexcel_xlsx/xlsx.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git i/pyexcel_xlsx/xlsx.py w/pyexcel_xlsx/xlsx.py
index ed01c1b..bf8a036 100644
--- i/pyexcel_xlsx/xlsx.py
+++ w/pyexcel_xlsx/xlsx.py
@@ -101,7 +101,7 @@ class XLSXBook(BookReader):

     def read_sheet(self, native_sheet):
         sheet = XLSXSheet(native_sheet, **self._keywords)
-        return {sheet.name: sheet.to_array()}
+        return {sheet.name: list(sheet.to_array())}

     def _load_the_excel_file(self, file_alike_object):
         self._native_book = openpyxl.load_workbook(
@@ -111,6 +111,9 @@ class XLSXBook(BookReader):
         self.skip_hidden_sheets = self._keywords.get(
             'skip_hidden_sheets', True)

+    def close(self):
+        self._native_book.close()
+        self._native_book = None

 class XLSXSheetWriter(SheetWriter):
     """

There's an issue in openpyxl, it's closing its ZipFile archive, but internally, ZipFile tracks reference counts, and the descriptor is not closed because there's still at least one reference.

benoit-pierre commented 7 years ago

This additional patch in openpyxl is necessary:

 openpyxl/worksheet/read_only.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git i/openpyxl/worksheet/read_only.py w/openpyxl/worksheet/read_only.py
index fd533569..93438328 100644
--- i/openpyxl/worksheet/read_only.py
+++ w/openpyxl/worksheet/read_only.py
@@ -89,7 +89,8 @@ class ReadOnlyWorksheet(object):
     def xml_source(self):
         """Parse xml source on demand, default to Excel archive"""
         if self._xml is None:
-            return self.parent._archive.open(self.worksheet_path)
+            from six import BytesIO
+            return BytesIO(self.parent._archive.read(self.worksheet_path))
         return self._xml

Now to see if I clean it up to make a proper PR...

chfw commented 7 years ago

The fix seems to be complex. My thought was to allow the developer to consume the data on demand so as to avoid reading the data into memory on behalf of the developer. "streaming=True" would enable on-demand feature. For this feature, the file handle should be kept open and eventually will be closed by garbage collector.

As in your changes, you have noticed that close method alone would cause a RuntimeError: Attempt to read ZIP archive that was already closed. So to cater the need to close the file handle manually, I would have to think of some alternatives.

benoit-pierre commented 7 years ago

I certainly don't expect pyexcel.get_book_dict to only read on demand, as opposed to something like iget_records. I don't think letting the garbage collector close file descriptors is a good idea, they should be closed right after I'm done with consuming whatever data it is I'm asking pyexcel for, either explicitly through a call to close on the context I'm using, or better yet with should be supported support so it's closed at the end of the block.

chfw commented 7 years ago

yes, you had your point there. the fix would be make those file handle closure explicit. with get* functions, auto closure is expected whereas iget*, the closure will be left to the developer after the generator has been consumed.

benoit-pierre commented 7 years ago

OK, for now I'll be using a patched version of pyexcel-xlsx (with the above patch) and of openpyxl (with the patch mentioned here).

chfw commented 7 years ago

I will put this fix in 0.4.0 branch, next major release. In this way, I could focus the effort on one branch instead of supporting two branches.

chfw commented 7 years ago

Inconsistent behaviour was found in this test run:

. file handle from iget_data was closed in python 3.6, pypy only but not with other python versions

. file handle from get_data was left open still in pypy .

openpyxl version is v2.4.8.