twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Remove `TypedPipe#narrowOrdering` #1856

Closed ttim closed 6 years ago

ttim commented 6 years ago

While in general Ordering can be considered contravariant, for OrderedSerialization it's not true - OrderedSerialization is invariant.

I've tried to remove Ordering narrowing and turned out it doesn't affect most call sites in Twitter repo. With this PR only 6 files weren't compiling but were simply fixable.

I think it worth removing this narrowing to make this code safer.

CLAassistant commented 6 years ago

CLA assistant check
All committers have signed the CLA.

johnynek commented 6 years ago

Can we do this: let’s add a test that fails with the old code, then do this.

My claim is no such test exists. Yes OrderedSerialization is invariant (as is Ordering) but the only method we need is compare, which is contravariant.

Put another way: this loses the contract that these sorting methods don’t change the elements of the TypedPipe, just the order. If the elements don’t change, just the order, the type shouldn’t change. Now the type does change after a sorting operation and that seems incorrect to me.

It is a subtle issue. But casts are not always incorrect. We don’t like them because we are bypassing the compilers ability to check for us. My claim is that this is safe 100% of the time, but scala by happenstance chose not to make Ordering contravariant (in fact implicit resolution is arguably broken for contravariant typeclasses and so that may have something to do with it).

Already 0.18 is very hard to adopt. Last I heard twitter still is not all the develop branch.

I’m not crazy about making it harder.

ttim commented 6 years ago

@johnynek here is a test which fails during reduce phase with exception Caused by: java.lang.ClassCastException: com.twitter.scalding.OrderedSerializationAndDistinctJob$User cannot be cast to com.twitter.scalding.OrderedSerializationAndDistinctJob$Student:

object OrderedSerializationAndDistinctJob {
  class User(val id: Long)

  implicit val ord: OrderedSerialization[User] =
    OrderedSerialization.viaTransform[User, Long](_.id, new User(_))(RequiredBinaryComparators.orderedSerialization[Long])

  class Student(override val id: Long, val name: String) extends User(id)
}

class OrderedSerializationAndDistinctJob(args: Args) extends Job(args) {
  import OrderedSerializationAndDistinctJob._

  TypedPipe
    .from(0 to 100)
    .map(id => new Student(id, "name"))
    .distinct
    .map(_.name)
    .write(TypedText.tsv[String]("output"))
}

class OrderedSerializationAndDistinctJobTest extends WordSpec with Matchers {
  "Distinct should work" should {
    JobTest(new OrderedSerializationAndDistinctJob(_))
      .sink[String](TypedText.tsv[String]("output")) { outputBuffer =>
        val data = outputBuffer.toList
        assert(data.size == 100, "should be 100 elements")
        assert(data.forall { _ == "name" }, "all elements should be name")
      }
      .runHadoop.finish()
  }
}

I'm not putting this test as part of PR because with this PR this test doesn't compile. I know it's kinda artificial use case because I've implemented OrderedSerialization on my own which people don't do, so I'm not really picking on this PR. But issue is that OrderedSerialization is conceptually invariant - it also serialize and deserialize things.

I think my biggest point here - it was literally 6 places over twitter repo to fix it so I don't think it actually gonna add pain for upgrades while it can lead to code which compiles, don't have asInstanceOf and still fails at runtime.

johnynek commented 6 years ago

okay, yeah, i see the issue.

We reflectively match on OrderedSerialization when used on a key. Easy to forget about that. So, distinct/distinctBy is not actually safe.

Can we move distinctBy here: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/TypedPipe.scala#L209

to simplify the type signature (so there is only one T not U >: T).

ttim commented 6 years ago

@johnynek made it for distinct. I think it can break for sortedTake as well if we delegate comparison of elements to cascading through OrderedSerialization and comparison of binary data. I think it worth to fix as well - it was only 1 broken place in whole twitter's repo.

johnynek commented 6 years ago

👍