mukunku / ParquetViewer

Simple Windows desktop application for viewing & querying Apache Parquet files
GNU General Public License v3.0
689 stars 82 forks source link

[BUG] Doesn't work with multiple row groups #27

Closed felipepessoto closed 3 years ago

felipepessoto commented 3 years ago

Parquet Viewer Version 2.2

Where was the parquet file created? C#

Sample File Test.zip

Describe the bug In UtilityMethods class this line contains a bug after the first call:

image

For example, if I have two row groups with 2 lines each. At first call the if will be if (rowIndex=0 >= readRecords=2) - OK But the next calls will be if (rowIndex=2 >= readRecords=2) and it will break. Unless the second row group is bigger than the first, but it is buggy anyway, since it will skip rows.

After fixing this issue, I also found another problem, where the row count is not respected after the first row group:

image

Screenshots image

Additional context Add any other context about the problem here.

Note: This tool relies on the parquet-dotnet library for all the actual Parquet processing. So any issues where that library cannot process a parquet file will not be addressed by us. Please open a ticket on that library's repo to address such issues.

mukunku commented 3 years ago

Thanks for the contribution @felipepessoto , I've merged it to v2.3. Sorry it took so long but it's hard to make time.

felipepessoto commented 3 years ago

Thanks @mukunku. Will you cherry pick it to main?

mukunku commented 3 years ago

There's one issue with your proposed changes, it doesn't calculate the total number of records in the file correctly. While working on it I realized the Thrift metadata already has the record count in it so I made some additional changes to utilize that.

It's possible to cherry pick both commits to master but I want to make sure my change won't also break something, hence the beta release. After a few weeks I'll merge all of it to master.

Also, if you're working with large files, give the new multi-threaded engine a try. I'm interested if it's stable or not because I saw a significant performance increase for large files (hundreds of columns and millions of rows).