mukunku / ParquetViewer

Simple Windows desktop application for viewing & querying Apache Parquet files
GNU General Public License v3.0
687 stars 82 forks source link

[BUG] Cannot open Parquet file with 2 similar column names (different case) #68

Closed MCRE-BE closed 1 year ago

MCRE-BE commented 1 year ago

Parquet Viewer Version What version of Parquet Viewer are you experiencing the issue with? 2.4.2.0

Where was the parquet file created? pyarrow

Sample File Example.zip

Describe the bug I believe the bug comes from having two column names that are equal when viewed as lowercase. I can open the file in pyarrow/python, not in ParquetViewer. Screenshot 2023-01-27 113819

Screenshots Screenshot 2023-01-27 113152

Additional context The similar column names is a bug in my code, but should not make the program crash.

Note: This tool relies on the parquet-dotnet library for all the actual Parquet processing. So any issues where that library cannot process a parquet file will not be addressed by us. Please open a ticket on that library's repo to address such issues.

mukunku commented 1 year ago

I gave this a shot but it turns out DataTables are case insensitive when it comes to column names. So it's not possible to show two fields with the same name.

For now I've added logic to gracefully exclude duplicate fields from the output. It's not ideal but at least the utility won't crash when opening such files.

Give it a shot here if you get the chance: https://github.com/mukunku/ParquetViewer/releases/tag/v2.5.1

I'll leave this ticket open since the original issue hasn't been solved and it should be possible, albeit difficult, to handle case sensitive field names.

MCRE-BE commented 1 year ago

So the issue is not with you but with the underlying library you are using to parse Parquet files? I can open a bug report there.

I'll test the fix, but indeed it's a workaround...

mukunku commented 1 year ago

@MCRE-BE The issue is with the data structure the app is using to store the data in memory. It doesn't support multiple columns with the same name because it's built to be case insensitive.

In your original bug report you mentioned:

The similar column names is a bug in my code, but should not make the program crash.

Is this a legitimate use case for your workflow or was it a mistake and you don't normally have same column names with different casing?

If this isn't a normal use case maybe just gracefully warning the user of the problem is a sufficient solution here: image

MCRE-BE commented 1 year ago

For me it was a mistake. So for me it's a sufficient solution, but might not be for others 🙄 But thanks for the fix 😄

I guess you can't change the column names easily (like setting a _x behind)? That's how pandas solves the issue in its dataframes.

mukunku commented 1 year ago

Appending a suffix might be the only way to handle these but it's not straightforward. Might not be worth investing time if it's such a rare use-case. Let's see if anyone else needs this kind of support. If demand increases I can take a look.