Open IvanHod opened 5 years ago
You can try compressing the xlsx file with an alternate compression algorithm. We use ZIP_DEFLATED but ZIP_LZMA might give you better results at a performance penalty. If you do find that it gives you better results maybe we can look into adding it as a tweakable argument to the writer.
You will need to modify the constant here: https://github.com/kz26/PyExcelerate/blob/247406dc41adc7e94542bcbf04589f1e5fdf8c51/pyexcelerate/Writer.py#L45
After zipping file using ZIP_LZMA, the size of the xlsx file is 36mb, but compressing time is 18 minutes. (The same file). So, it is not solution.
After zipping a file with the help of ZIP_BZIP2 compression, it is not possible to open file.
I think, the problem is, using the PyExcelerate, that a sheet contains the all strings as values. Pandas makes a separate file "sharedStrings.xml", which keeps only different strings, and doesn't keep the same strings. Because of this, unzipped xls sheet files have size 462mb for pandas and 795mb for PyExcelerate.
Yes this is partially an optimization. When we were implementing it, we found that building that table was nontrivially expensive for large sheets, so we opted to ignore it hoping that it would be a negligible cost due to a zip deflation. I see that's not the case though, I'll take another look at the shared strings table and see if we can optimize it behind an option or something.
When I created Excel files using PyExcelerate and pandas, the size of pandas xlsx file was less by 13mb. (48mb for PyExcelerate and 35mb for pandas). Can I decrease size of PyExcelerate file? The table had 100 000 rows and 70 columns.