mukunku / ParquetViewer

Simple Windows desktop application for viewing & querying Apache Parquet files
GNU General Public License v3.0
687 stars 82 forks source link

[BUG] Unable to display data exported from Oracle database #81

Closed dvesic closed 10 months ago

dvesic commented 1 year ago

Parquet Viewer Version: 2.7.1.0

Where was the parquet file created? python, using pandas and fastparquet library

Sample File Sample file attached.

Example.zip

Describe the bug Try to open file; if you select only first column, if will open fine. If you select all, second will cause problem and no data will be displayed.

Screenshots Attached screenshot.

parq-viewer-2 7 1 0-bug-screenshot

Additional context Original column definition from Oracle database:

Limit type        NOT NULL VARCHAR2(20)
Limit period in days                NUMBER
mukunku commented 10 months ago

It looks like the parquet-dotnet library we use doesn't support your file.

The quickest way to get it resolved would be to open a ticket in their repo: https://github.com/aloneguid/parquet-dotnet/issues

I'll see if I can take a look and fix it on my end but can't promise anything on timing.

mukunku commented 10 months ago

@dvesic So I figured out why your file can't be opened. It appears to me to be malformed; there's an extra byte of data in the data page for some reason that's throwing off the parquet-dotnet library.

I created a release with a patch that you can download here: ParquetViewer_PR81_v0.zip

I created this release from this fork I made of the parquet-dotnet library. However I don't think this solution is correct, assuming the file really is malformed. So I won't be adding a fix for this in any of the main releases.

You can use the patched ParquetViewer I shared above for your own files and hopefully this bug will get fixed in future versions of Oracle or parquet-dotnet.

Also, thanks for sharing your bug and a test file.

cc: @aloneguid

mukunku commented 10 months ago

I'm going to close out this issue for now. But please feel free to re-open if you want to discuss further.

dvesic commented 9 months ago

Thank you very much for the patch - I appreciate it.

mukunku commented 9 months ago

Hey @dvesic ,

I came across this bug: https://github.com/dask/fastparquet/issues/855

I noticed the file you shared had this in its metadata:

   "CreatedBy": "fastparquet-python version 2023.4.0 (build 0)",

They seem to have fixed an issue with string byte array sizes which is very similar to the behavior I was observing when reviewing your file.

I wonder if your Oracle can be updated to use version 2023.8.0 instead of 2023.4.0. If you get the chance to test it with the newer version please let me know if the issue is fixed in the latest regular release.