Closed mucmch closed 1 year ago
Hi there,
So because Parquet files are columnar storage and we're converting the data into a row based structure the reality is performance is going to get worse the more columns there are. I don't think there's any way to avoid that.
A few things you can try:
For other kind of optimizations I can take a look but could you share a sample file or two so there's something to work with?
Hey @mucmch Any chance you could share a sample file? I could maybe take a look to see if there's any way I can increase the performance any further.
Hi mukunku,
thanks for your feedback and chaser. Sorry, have been quite busy. Please find a sample parquet file attached.
It has 100 rows and 5000 columns, 7'391 kb. In Python, I can read it within <0.36 secs. In Parquetviewer, the analyzing stage took 10 sec or even caused my system to freeze. The loading stage took another 2 minutes. The memory usage was low, but there was 100% usage of a single cpu core.
As I just wanted to have the possibility to have a quick file preview in Windows by clicking on the parquet file, I now found another efficient solution for my case.
1) A simple python script to read parquet and print essentials:
import pandas as pd
import sys
file = sys.argv[1]
df = pd.read_parquet(file)
print(df)
print(df.info())
input()
2) A simple file association in the cmd prompt (as admin):
assoc .parquet=parquetfile
ftype parquetfile="<pathtoyourpython>\python.exe"
"<pathtoyourscriptfile>\parquet_win_preview.py" "%1" "%2" %*
@mucmch Can you try the latest v2.6.0 beta release please? https://github.com/mukunku/ParquetViewer/releases/tag/v2.6.0.1
I tried loading that test file with this version and it loads the entire file in less than 5 seconds. I can see this by hovering my cursor over the Loaded:
text in the bottom right corner:
I see a comparable performance of the new version on my side. Also tested some other files that yield decent results for most files < 100 MB. Thanks you lot for your effort @mukunku !
But I still should not stretch it too much... e.g. a highly compressed 30 MB parquet file, that results in 3 GB when fully loaded [8'650 x 46'250 table] is still loading after 20 min...
Thanks for testing out the latest release @mucmch . Glad it's working at least a little bit better.
Any chance you could share a file like you described? It's really hard for me to optimize without having a sample file.
Please find a test file attached. Dimensions [10000 x 40000], 3.25 sec reading time in Python vs ParquetViewer loading fields 20 sec, loading data 5 min +++ (did not wait for it to finish). Test2.zip
So unfortunately getting the application to support data with 10's of thousands of columns isn't going to be possible. Like the sample file you shared with 40k columns. I tried to see if I could somehow get it to work but WinForms simply can't handle it.
I did add some other improvements to handle files like these more gracefully so I'm going to mark this ticket as Won't fix
and close it out. I will detail the changes I made below along with my rationale for why I think this is good enough:
Parquet Viewer Version 2.5.1.0
Where was the parquet file created? pyarrow
Describe the bug I am working with parquet files with many columns (e.g. 30'000). There is seemingly no efficient way to limit the columns - a deselect all option would be very helpful. Even for small parquet files with many columns (e.g. 3 MB, 239 rows, 3643 columns), the data load takes long (3-5 min, massive CPU usage, low memory consumption). For larger files (e.g. 12 MB, 239 rows, about 10'000 columns), the whole system freezes already during the file analysis stage.
Is there any way to work with files with many columns?