Open buhtz opened 2 months ago
Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users.
I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before.
I just wanted to warn you that this is probably not a priority for us
I just wanted to warn you that this is probably not a priority for us
Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.
Installed Versions
Prior Performance
I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.
Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an
io.BytesIO()
object and using this withpandas.read_pickle()
(6sec in my example).In the example code below the function
unpickle_from_memory()
demonstrate the fast way. The slower one isunpickle_via_pandas()
andunpickle_zip_filehandle()
. The later might be an example about how pandas work internally with that zip file.Here is the output from the script:
My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in
unpickle_from_memory()
?