mukunku / ParquetViewer

Simple Windows desktop application for viewing & querying Apache Parquet files
GNU General Public License v3.0
687 stars 82 forks source link

[BUG] Handling Files with many columns #69

Closed mucmch closed 1 year ago

mucmch commented 1 year ago

Parquet Viewer Version 2.5.1.0

Where was the parquet file created? pyarrow

Describe the bug I am working with parquet files with many columns (e.g. 30'000). There is seemingly no efficient way to limit the columns - a deselect all option would be very helpful. Even for small parquet files with many columns (e.g. 3 MB, 239 rows, 3643 columns), the data load takes long (3-5 min, massive CPU usage, low memory consumption). For larger files (e.g. 12 MB, 239 rows, about 10'000 columns), the whole system freezes already during the file analysis stage.

Is there any way to work with files with many columns?

mukunku commented 1 year ago

Hi there,

So because Parquet files are columnar storage and we're converting the data into a row based structure the reality is performance is going to get worse the more columns there are. I don't think there's any way to avoid that.

A few things you can try:

For other kind of optimizations I can take a look but could you share a sample file or two so there's something to work with?

mukunku commented 1 year ago

Hey @mucmch Any chance you could share a sample file? I could maybe take a look to see if there's any way I can increase the performance any further.

mucmch commented 1 year ago

Hi mukunku,

thanks for your feedback and chaser. Sorry, have been quite busy. Please find a sample parquet file attached.

It has 100 rows and 5000 columns, 7'391 kb. In Python, I can read it within <0.36 secs. In Parquetviewer, the analyzing stage took 10 sec or even caused my system to freeze. The loading stage took another 2 minutes. The memory usage was low, but there was 100% usage of a single cpu core.

As I just wanted to have the possibility to have a quick file preview in Windows by clicking on the parquet file, I now found another efficient solution for my case.

1) A simple python script to read parquet and print essentials:

import pandas as pd import sys file = sys.argv[1] df = pd.read_parquet(file) print(df) print(df.info()) input()

2) A simple file association in the cmd prompt (as admin):

assoc .parquet=parquetfile ftype parquetfile="<pathtoyourpython>\python.exe" "<pathtoyourscriptfile>\parquet_win_preview.py" "%1" "%2" %*

Test.zip

mukunku commented 1 year ago

@mucmch Can you try the latest v2.6.0 beta release please? https://github.com/mukunku/ParquetViewer/releases/tag/v2.6.0.1

I tried loading that test file with this version and it loads the entire file in less than 5 seconds. I can see this by hovering my cursor over the Loaded: text in the bottom right corner: image

mucmch commented 1 year ago

I see a comparable performance of the new version on my side. Also tested some other files that yield decent results for most files < 100 MB. Thanks you lot for your effort @mukunku !

But I still should not stretch it too much... e.g. a highly compressed 30 MB parquet file, that results in 3 GB when fully loaded [8'650 x 46'250 table] is still loading after 20 min...

mukunku commented 1 year ago

Thanks for testing out the latest release @mucmch . Glad it's working at least a little bit better.

Any chance you could share a file like you described? It's really hard for me to optimize without having a sample file.

mucmch commented 1 year ago

Please find a test file attached. Dimensions [10000 x 40000], 3.25 sec reading time in Python vs ParquetViewer loading fields 20 sec, loading data 5 min +++ (did not wait for it to finish). Test2.zip

mukunku commented 1 year ago

So unfortunately getting the application to support data with 10's of thousands of columns isn't going to be possible. Like the sample file you shared with 40k columns. I tried to see if I could somehow get it to work but WinForms simply can't handle it.

I did add some other improvements to handle files like these more gracefully so I'm going to mark this ticket as Won't fix and close it out. I will detail the changes I made below along with my rationale for why I think this is good enough:

  1. I don't think a lot of people are going to be using files with 10's of thousands of columns
  2. Considering Parquet is columnar storage, I think it's reasonable for performance to degrade when accessing data on a row by row basis.
  3. I automated the ParquetEngine setting in v2.6.0 so that files with many columns automatically utilize the multi-threaded engine.
  4. The field selection dialog as of v2.7.0 will no longer crash/hang when opening files with a lot of columns. You won't be able to filter the columns unfortunately but at least the UI is responsive.
  5. As of v2.6.0 the loading screen shows a loading bar. Giving users some kind of idea when a file load will be completed.
  6. As of v2.7.0 the metadata viewer won't hang for a long time when loading for files with many columns