Closed ttim closed 6 years ago
Can we do this: let’s add a test that fails with the old code, then do this.
My claim is no such test exists. Yes OrderedSerialization is invariant (as is Ordering) but the only method we need is compare, which is contravariant.
Put another way: this loses the contract that these sorting methods don’t change the elements of the TypedPipe, just the order. If the elements don’t change, just the order, the type shouldn’t change. Now the type does change after a sorting operation and that seems incorrect to me.
It is a subtle issue. But casts are not always incorrect. We don’t like them because we are bypassing the compilers ability to check for us. My claim is that this is safe 100% of the time, but scala by happenstance chose not to make Ordering contravariant (in fact implicit resolution is arguably broken for contravariant typeclasses and so that may have something to do with it).
Already 0.18 is very hard to adopt. Last I heard twitter still is not all the develop branch.
I’m not crazy about making it harder.
@johnynek here is a test which fails during reduce phase with exception Caused by: java.lang.ClassCastException: com.twitter.scalding.OrderedSerializationAndDistinctJob$User cannot be cast to com.twitter.scalding.OrderedSerializationAndDistinctJob$Student
:
object OrderedSerializationAndDistinctJob {
class User(val id: Long)
implicit val ord: OrderedSerialization[User] =
OrderedSerialization.viaTransform[User, Long](_.id, new User(_))(RequiredBinaryComparators.orderedSerialization[Long])
class Student(override val id: Long, val name: String) extends User(id)
}
class OrderedSerializationAndDistinctJob(args: Args) extends Job(args) {
import OrderedSerializationAndDistinctJob._
TypedPipe
.from(0 to 100)
.map(id => new Student(id, "name"))
.distinct
.map(_.name)
.write(TypedText.tsv[String]("output"))
}
class OrderedSerializationAndDistinctJobTest extends WordSpec with Matchers {
"Distinct should work" should {
JobTest(new OrderedSerializationAndDistinctJob(_))
.sink[String](TypedText.tsv[String]("output")) { outputBuffer =>
val data = outputBuffer.toList
assert(data.size == 100, "should be 100 elements")
assert(data.forall { _ == "name" }, "all elements should be name")
}
.runHadoop.finish()
}
}
I'm not putting this test as part of PR because with this PR this test doesn't compile.
I know it's kinda artificial use case because I've implemented OrderedSerialization
on my own which people don't do, so I'm not really picking on this PR. But issue is that OrderedSerialization
is conceptually invariant - it also serialize and deserialize things.
I think my biggest point here - it was literally 6 places over twitter repo to fix it so I don't think it actually gonna add pain for upgrades while it can lead to code which compiles, don't have asInstanceOf
and still fails at runtime.
okay, yeah, i see the issue.
We reflectively match on OrderedSerialization when used on a key. Easy to forget about that. So, distinct/distinctBy is not actually safe.
Can we move distinctBy here: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/TypedPipe.scala#L209
to simplify the type signature (so there is only one T
not U >: T
).
@johnynek made it for distinct
. I think it can break for sortedTake
as well if we delegate comparison of elements to cascading through OrderedSerialization
and comparison of binary data.
I think it worth to fix as well - it was only 1 broken place in whole twitter's repo.
👍
While in general
Ordering
can be considered contravariant, forOrderedSerialization
it's not true -OrderedSerialization
is invariant.I've tried to remove
Ordering
narrowing and turned out it doesn't affect most call sites in Twitter repo. With this PR only 6 files weren't compiling but were simply fixable.I think it worth removing this narrowing to make this code safer.