twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 703 forks source link

Improve spark backend a bit, add some write tests #1902

Closed johnynek closed 5 years ago

johnynek commented 5 years ago

this is based on some work with @stephbian.

We are attempting to use spark-scalding for an internal library that compiles to scalding, vs make it compile to spark.

We do three things:

  1. make the persist level configurable
  2. support for TextLine and WritableSequenceFile sources out of the box.
  3. add tests for those.

I didn't discover any bugs, but probably the pattern wasn't clear to people. It would be nice to add more built in types. I will try to make a follow up using csv/tsv which requires a bit more work (since scalding and spark both need typeclasses to describe the types).

johnynek commented 5 years ago

@ttim @dieu can you all take a look?

stephbian commented 5 years ago

nice tests. i'm excited to see whether switching the persist mode will fix the maxResultSize exceptions i'm seeing.

johnynek commented 5 years ago

Thanks for the review @stephbian ! I'll publish an internal version tomorrow and we can see if the persist changes are suitable for us.