twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Alternative implementation to DeprecatedParquetInputFormat with fix #1937

Closed tlazaro closed 3 years ago

tlazaro commented 3 years ago

When combining N Parquet files, the first record of files 2 to N gets skipped while the last record from the previous file is returned instead. This means losing some records while others get duplicated, quite bad.

This was fixed a month ago in https://github.com/apache/parquet-mr/pull/844 but we would need to update the dependencies.

Should we do this approach or work towards updating deps?

tlazaro commented 3 years ago

This fix is not available in 1.11.0 and I guess would be released in 1.12.0 eventually which was just been crated as RC https://github.com/apache/parquet-mr/releases

johnynek commented 3 years ago

Wow. Amazing such a bug could live so long on just not be found.

It seems at some stage code was moved or copied to this repo. To be clear: do we need to get the published 1.12 to get the full fix?

johnynek commented 3 years ago

Would it be too hard to add a test that shows this bug and verify that this change fixes it?

tlazaro commented 3 years ago

Wow. Amazing such a bug could live so long on just not be found.

We had a PR up with a fix in 2018 but never followed through, just picked it up last week and now noticing it was fixed... :(

This PR is not needed if they release 1.12.0 and we update version. BTW we did not copy code over to this repo, I copied the class and replaced usage to avoid the update.

Regarding tests, their PR has tests, I could bring them over.

This PR is dirty though, it's the wrong way of fixing it but would get it quickly... we may just do it to a twitter branch and not develop?

johnynek commented 3 years ago

I think this would hit anyone using scalding so it is kinder to do it here.

Can you copy the test case over and add an issue to revert when a fix is released in upstream?

tlazaro commented 3 years ago

I updated ScaldingDeprecatedParquetInputFormat to match exactly what is on apache-parquet repository, (I had previously started from a 1.9.0 version). File hasn't changed since 2 years ago, except for this very recent fix.

Brought over the unit tests to validate this class and manually verified that with the removal of the one liner with the fix the unit test would fail.