microsoft / python-in-excel

Python in Microsoft Excel
MIT License
490 stars 31 forks source link

How will you save the data of dataframe on Excel? #16

Closed tonyqus closed 10 months ago

tonyqus commented 10 months ago

This is Tony Qu from NPOI team. NPOI is one of the major Office read/write library and it's open source and free to use.

I'm somewhat worrying about this feature as I mentioned in a discussion. I'd like to know how you will save the data of a dataframe on Excel. Will there be a cache of dataframe on local? If so, is there any data size limitation for this case?

jflam commented 10 months ago

Hey Tony and thanks for your work on NPOI!

We do not persist the dataframe into the Excel workbook; it is treated as transient data in the container. Only values explicitly returned to the grid by value are persisted into the workbook, i.e., you could "spill" your dataframe into the workbook subject to existing workbook limits.

tonyqus commented 10 months ago

you could "spill" your dataframe into the workbook subject to existing workbook limits.

Can you explain this in detail? It sounds it's users' choice to spill the dataframe into a sheet. And according to xlsx specification, there is almost no limitation for the row number. (For xls, it's 65535 for each sheet).

Does it mean there will be a potential risk of huge data in Excel? I totally understand python is very strong in big data area since I used to be a big data architect. But frankly speaking, I don't think xlsx file is a good container for big data because of the file format design (xml+zip). I prefer parquet format as the intermediate data transfer format because the format is more easy to read and can somewhat avoid OOM.

You know it's a bit annoying that a few developers are complaining that NPOI (or EPPlus) will be out of memory while they are putting thousounds of rows with at least 10-20 columns into a Excel sheet. It's always a nightmare for any Excel library.

jflam commented 10 months ago

There are two ways to return data to the grid - one is "by reference" which is the default where no data is marshaled to the grid. The other is "by value" where all the values in the object (e.g., DataFrame) are copied to the grid.

We anticipate that most data is returned by reference with only the "end result" being returned by value. I don't think users will typically be dumping giant amounts of data (doesn't mean they won't though ...) into the grid as it's not really useful to manually examine tons of data in the grid.

tonyqus commented 10 months ago

Sounds not too bad. Thank you for the detailed explanation.