twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

partial port of PARQUET-1191 - handle thrift binary correctly. #1872

Closed thomaschow closed 5 years ago

thomaschow commented 6 years ago

This ports over part of https://github.com/apache/parquet-mr/compare/master...nandorKollar:PARQUET-1191 where a check is added to ScroogeStructConverter when mapping to Thrift String types whether the field is a string or binary (byte array). This allows the parquet-thrift ThriftSchemaConverter to generate the correct parquet schema (distinguishing between Strings, represented as UTF-8 annotated BINARY https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string, or just plain BINARY) https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java#L332 NOTE: this requires a version bump of parquet

r? @ianoc @johnynek

CLAassistant commented 6 years ago

CLA assistant check
All committers have signed the CLA.

johnynek commented 6 years ago

Thanks for the PR!

There is a test that fails that looks related to this: https://travis-ci.org/twitter/scalding/jobs/428467866#L9274

It may be that the test needs to change, but can you take a look?

thomaschow commented 6 years ago

sure, @johnynek do you have any thoughts on whether this is the correct approach? Why have we ported over the ScroogeStructConverter at all?

johnynek commented 6 years ago

This seems fine. I think the original PR updates the test to add the UTF annotation.

I think we should do that here and we should be okay.

thomaschow commented 6 years ago

@johnynek I'm getting some incompatibilities when generating thrift, something like: [error] /Users/tchow/personal/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_scala/test/TestUnion.scala:226: overriding method containedValue in trait ThriftUnion of type ()SecondMap.this.ContainedType; [error] method containedValue has incompatible type [error] def containedValue(): SecondMapAlias = second_map

This is unrelated to my changes, looks like just need to figure out the correct library versions. Can you advise on how to resolve this?

johnynek commented 6 years ago

Can you find the minimal version of scrooge that can pass the tests and work here?

also @ttim can you give any comments on this bug and if twitter is still using this exact code (or do you have a copy of it internally)?

dieu commented 6 years ago

@thomaschow fix for scalding-parquet module, will dig more tomorrow.

diff --git a/build.sbt b/build.sbt
index cd5d8cb..34881af 100644
--- a/build.sbt
+++ b/build.sbt
@@ -407,7 +407,7 @@ lazy val scaldingParquet = module("parquet").settings(
       exclude("com.twitter.elephantbird", "elephant-bird-pig")
       exclude("com.twitter.elephantbird", "elephant-bird-core"),
     "org.scala-lang" % "scala-compiler" % scalaVersion.value,
-    "org.apache.thrift" % "libthrift" % "0.7.0",
+    "org.apache.thrift" % "libthrift" % thriftVersion,
     "org.slf4j" % "slf4j-api" % slf4jVersion,
     "org.apache.hadoop" % "hadoop-client" % hadoopVersion % "provided",
     "org.scala-lang" % "scala-reflect" % scalaVersion.value,
diff --git a/scalding-parquet/src/test/scala/com/twitter/scalding/parquet/ParquetSourcesTests.scala b/scalding-parquet/src/test/scala/com/twitter/scalding/parquet/ParquetSourcesTests.scala
index 82689a3..db74f2d 100644
--- a/scalding-parquet/src/test/scala/com/twitter/scalding/parquet/ParquetSourcesTests.scala
+++ b/scalding-parquet/src/test/scala/com/twitter/scalding/parquet/ParquetSourcesTests.scala
@@ -169,7 +169,7 @@ class MockTBase extends TBase[MockTBase, TFieldIdEnum] {
   override def isSet(p1: TFieldIdEnum): Boolean = false
   override def getFieldValue(p1: TFieldIdEnum): AnyRef = null
   override def setFieldValue(p1: TFieldIdEnum, p2: scala.Any): Unit = ()
-  override def deepCopy(): TBase[MockTBase, TFieldIdEnum] = null
+  override def deepCopy(): MockTBase = null
   override def clear(): Unit = ()
   override def compareTo(o: MockTBase): Int = 0
 }
dieu commented 6 years ago
diff --git a/project/plugins.sbt b/project/plugins.sbt
index 5131951..4c535e9 100644
--- a/project/plugins.sbt
+++ b/project/plugins.sbt
@@ -11,7 +11,7 @@ addSbtPlugin("com.eed3si9n"       % "sbt-unidoc"          % "0.3.3")
 addSbtPlugin("com.fortysevendeg"  % "sbt-microsites"      % "0.3.3")
 addSbtPlugin("com.github.gseitz"  % "sbt-release"         % "1.0.0")
 addSbtPlugin("com.jsuereth"       % "sbt-pgp"             % "1.0.0")
-addSbtPlugin("com.twitter"        %% "scrooge-sbt-plugin" % "4.14.0")
+addSbtPlugin("com.twitter"        %% "scrooge-sbt-plugin" % "18.9.0")
 addSbtPlugin("com.typesafe"       % "sbt-mima-plugin"     % "0.1.14")
 addSbtPlugin("com.typesafe.sbt"   % "sbt-ghpages"         % "0.5.4")
 addSbtPlugin("com.typesafe.sbt"   % "sbt-git"             % "0.6.2")

so, also @thomaschow you need upgrade scrooge pluing.

oscar-stripe commented 5 years ago

looks like it doesn't compile, does this compile locally for you @tchow-stripe ?


[warn] 1248 warnings found
[warn] bootstrap class path not set in conjunction with -source 1.6
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/UnionV2.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.UnionV2 is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class UnionV2 extends TUnion<UnionV2, UnionV2._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/UnionV2.java:227: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/UnionV2.java:274: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/UnionV2.java:297: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/UnionV2.java:302: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/TestUnion.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.TestUnion is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class TestUnion extends TUnion<TestUnion, TestUnion._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/TestUnion.java:207: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/TestUnion.java:243: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/TestUnion.java:261: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/TestUnion.java:266: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionOfStructs.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.UnionOfStructs is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class UnionOfStructs extends TUnion<UnionOfStructs, UnionOfStructs._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionOfStructs.java:227: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionOfStructs.java:274: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionOfStructs.java:297: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionOfStructs.java:302: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedNestedUnion.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.NestedNestedUnion is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class NestedNestedUnion extends TUnion<NestedNestedUnion, NestedNestedUnion._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedNestedUnion.java:207: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedNestedUnion.java:243: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedNestedUnion.java:261: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedNestedUnion.java:266: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedUnion.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.NestedUnion is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class NestedUnion extends TUnion<NestedUnion, NestedUnion._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedUnion.java:227: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedUnion.java:274: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedUnion.java:297: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/NestedUnion.java:302: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV2.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.UnionV2 is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class UnionV2 extends TUnion<UnionV2, UnionV2._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV2.java:227: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV2.java:274: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV2.java:297: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV2.java:302: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionStructUnion.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.UnionStructUnion is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class UnionStructUnion extends TUnion<UnionStructUnion, UnionStructUnion._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionStructUnion.java:227: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionStructUnion.java:274: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionStructUnion.java:297: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionStructUnion.java:302: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV1.java:29: com.twitter.scalding.parquet.scrooge.thrift_java.test.compat.UnionV1 is not abstract and does not override abstract method writeValue(org.apache.thrift.protocol.TProtocol) in org.apache.thrift.TUnion
[error] public class UnionV1 extends TUnion<UnionV1, UnionV1._Fields> {
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV1.java:207: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV1.java:243: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV1.java:261: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] /home/travis/build/twitter/scalding/scalding-parquet-scrooge-fixtures/target/scala-2.11/src_managed/test/thrift/com/twitter/scalding/parquet/scrooge/thrift_java/test/compat/UnionV1.java:266: method does not override or implement a method from a supertype
[error]   @java.lang.Override
[error] (scalding-parquet-scrooge-fixtures/test:compileIncremental) javac returned nonzero exit code
[error] Total time: 95 s, completed Nov 5, 2018 6:58:01 PM``
oscar-stripe commented 5 years ago

cc @thomaschow

johnynek commented 5 years ago

@thomaschow do you want to work on this before we publish a new version which is about to happen?

tchow-stripe commented 5 years ago

sure, let me try to make some progress on it today and I"ll have an update by tomorrow.

Thoams On Tue, Dec 4, 2018 at 12:34 PM P. Oscar Boykin notifications@github.com wrote:

@thomaschow https://github.com/thomaschow do you want to work on this before we publish a new version which is about to happen?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twitter/scalding/pull/1872#issuecomment-444247954, or mute the thread https://github.com/notifications/unsubscribe-auth/AYI1j43sUnsngOF7tUZ7jiSih1b8ry89ks5u1txWgaJpZM4WonZL .

On Tue, Dec 4, 2018 at 12:34 PM P. Oscar Boykin notifications@github.com wrote:

@thomaschow https://github.com/thomaschow do you want to work on this before we publish a new version which is about to happen?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twitter/scalding/pull/1872#issuecomment-444247954, or mute the thread https://github.com/notifications/unsubscribe-auth/AYI1j43sUnsngOF7tUZ7jiSih1b8ry89ks5u1txWgaJpZM4WonZL .

thomaschow commented 5 years ago

@johnynek is there a way to re-run a specific subuild of the ci?

thomaschow commented 5 years ago

I'd like some help with understanding the code gen. The problem here is that whatever's doing the code gen (I believe scrooge) doesn't support binary types. If someone could advise on how to debug I can adjust the unit tests appropriately.

thomaschow commented 5 years ago

will time out by eod cc @dieu @ttim

johnynek commented 5 years ago

@thomaschow scrooge support binary, I'm pretty sure. ByteBuffer is the type it generates, I think.

In any case, what specifically is the question? The CI is green.

thomaschow commented 5 years ago

the StringAndBinary that gets generated doesn't have the FieldMetadata set correctly (ie binary is set to false). I've bypassed scrooge and instead used the inline ScroogeStructConverter.

ianoc commented 5 years ago

Do you have something that fails or shows the issue? its possible the field metadata is broken, afaik the biggest user of it was parquet interactions... at least 3-4 years ago i believe. But at the same time I'd like to see something failing if we could? The scrooge struct converter i'm sure uses the same scrooge fields no?

thomaschow commented 5 years ago

yeah so the unit test change I made is a hack -- it's effectively a noop comparision. To fix it properly, the StringAndBinary class needs to be autogen'ed properly with the binary boolean set. If I revert the unit test this will break for that reason, and it doesn't seem addressable in this repo

ianoc commented 5 years ago

which is broken? I don't think we can commit it with the test actively broken in this way -- commenting them out would be better. This makes them sort of look like real tests but they do nothing.

if its just one type is broken can we just disable that test?

thomaschow commented 5 years ago

if its just one type is broken can we just disable that test?

I'm happy to do this and leave a comment.

johnynek commented 5 years ago

I'm sorry I missed this.

If we can't reproduce this in a test, why do we think we fixed the bug? It seems credible, but if we don't have a test that fails without the patch, I'm skeptical we have made any improvements.

thomaschow commented 5 years ago

but if we don't have a test that fails without the patch, I'm skeptical we have made any improvements.

From my understanding of the code, the correct fix and testing is elsewhere (possibly in scrooge).

ianoc commented 5 years ago

So i took a little look at this -- with these in future, always good to paste the actual error into the PR for folks, the error is:

[error] Test com.twitter.scalding.parquet.scrooge.ScroogeStructConverterTest.testBinary failed: expected:<message ParquetSchema {
[error]   required binary s (UTF8) = 1;
[error]   required binary b (UTF8) = 2;
[error] }
[error] > but was:<message ParquetSchema {
[error]   required binary s (UTF8) = 1;
[error]   required binary b = 2;
[error] }

Concretely:

System.out.println("structFromThriftSchemaConverter");
      System.out.println(toParquetSchema(structFromThriftSchemaConverter));
      System.out.println("structFromScroogeSchemaConverter");
      System.out.println(toParquetSchema(structFromScroogeSchemaConverter));

Yields:

structFromThriftSchemaConverter
message ParquetSchema {
  required binary s (UTF8) = 1;
  required binary b (UTF8) = 2;
}

structFromScroogeSchemaConverter
message ParquetSchema {
  required binary s (UTF8) = 1;
  required binary b = 2;
}

So from my reading of that, there is nothing wrong with the scrooge code in play here. It looks like the code added is having the expected effect (though we need some tests here to cover the change). The change your migrating looks like it should be in the version of parquet we are using, but it appears at least to be broken/doesn't work.

thomaschow commented 5 years ago

@ianoc sry thanks for the tips on etiquette here, and appreciate the digging. What would you recommend here? I've bumped parquet to the latest published version, and matched the libthrift version.

thomaschow commented 5 years ago

but the bug here is because we just inline the ScroogeStructConverter (which would otherwise be coming from parquet) and that has diverged.

ianoc commented 5 years ago

i'm not really following that -- based on the above printouts it looks like the TBase/JavaThrift converter paths is producing the wrong result based on what your aiming to get out of it?

As a first step i'd put this test on hold and write some direct tests for the behavior you expect to change here, since right now it looks like the TBase/ThriftStruct implementations don't agree, but some unit tests for each should nail that down

thomaschow commented 5 years ago

The current unit test imo already reveals the issue, hence why it's failing. The issue is in the difference between the StringAndBinary java generated classes of scalding and the main parquet project. Here's what's different (and why it's failing in scalding):

parquet-mr:

    tmpMap.put(_Fields.B, new org.apache.thrift.meta_data.FieldMetaData("b", org.apache.thrift.TFieldRequirementType.REQUIRED,
        new org.apache.thrift.meta_data.FieldValueMetaData(org.apache.thrift.protocol.TType.STRING        , true)));

scalding:

    tmpMap.put(_Fields.B, new FieldMetaData("b", TFieldRequirementType.REQUIRED,
      registerBinaryFieldValueMetaData(new FieldValueMetaData(TType.STRING), tmpSet)));

What's generated in scalding doesn't set the FieldValueMetaData correctly even though I've matched the libthrift versions to 0.9.3.

FYI this same unit test is in parquet-mr: https://github.com/apache/parquet-mr/blob/master/parquet-scrooge/src/test/java/org/apache/parquet/scrooge/ScroogeStructConverterTest.java#L81

I ran the following commands to generate these classes:

parquet-mr:

wget -nv http://archive.apache.org/dist/thrift/0.9.3/thrift-0.9.3.tar.gz
tar xzf thrift-0.9.3.tar.gz
cd thrift-0.9.3
chmod +x ./configure
./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang --without-php --without-nodejs
LC_ALL=C mvn -pl parquet-thrift -am clean compile

scalding:


./sbt -java-home /usr
scalding-parquet-scrooge/test
ianoc commented 5 years ago

@thomaschow a direct tests here would be that we seem to be mentioning java code gen vs scala code gen. Which feels to imply our testing isn't localized that we are testing our change directly. We are testing comparsions between java and scala. Which confuses the issue. Without your change do all the tests pass clean with the upgrade libraries? If they do then we should add a test using the scala/scrooge libraries + parquet (no java thrift) only and make a test that fails with the StringAndBinary data. Once that test fails we can then make it pass. looking to see if there is a bug in our code for the java handling/generation in scrooge we can then think about separately -- ultimately we might be ok to ignore that if we can test the changes we need here directly and show its a code gen issue for java. But if we disable this test we just aren't testing your change at all

thomaschow commented 5 years ago

@ianoc thanks for the advice -- so the tests pass with just the library version bumps and without my change, so I've added a test that bypasses java vs scala and just compares at the parquet level. Let me know if this is sufficient.

thomaschow commented 5 years ago

@ianoc any feedback on the current test?

without my change, it fails with:

[error] Test com.twitter.scalding.parquet.scrooge.ScroogeStructConverterTest.testScroogeBinary failed: expected:<message ParquetSchema {
[error]   required binary s (UTF8) = 1;
[error]   required binary b = 2;
[error] }
[error] > but was:<message ParquetSchema {
[error]   required binary s (UTF8) = 1;
[error]   required binary b (UTF8) = 2;
[error] }
[error] >, took 0.002 sec
[info] ScalaTest
[info] Run completed in 35 seconds, 390 milliseconds.
[info] Total number of tests run: 39
[info] Suites: completed 4, aborted 0
[info] Tests: succeeded 39, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[error] Failed: Total 70, Failed 1, Errors 0, Passed 69
[error] Failed tests:
[error]     com.twitter.scalding.parquet.scrooge.ScroogeStructConverterTest
[error] (scalding-parquet-scrooge/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 45 s, completed Dec 11, 2018 10:51:33 AM
thomaschow commented 5 years ago

cc @johnynek I'd like to merge this in so we can pick it up with the next scalding version -- does this test look sufficient to you?

johnynek commented 5 years ago

Thanks for showing the test that fails without the changes!