twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Installation/getting started instructions outdated? #535

Open johnmaxwelliv opened 11 years ago

johnmaxwelliv commented 11 years ago

Hi,

I've been trying to install & run Scalding, but I've been running in to some issues... it looks as though some of the installation directions may be outdated. I could go through the wiki and try to fix things myself, but I'd prefer it if someone more knowledgable about the project did instead, as I don't want to add any incorrect information.

Working from https://github.com/twitter/scalding/wiki/Getting-Started I ran in to this error:

Johns-MacBook-Pro :: ~/tmp » git clone git@github.com:twitter/scalding.git -b develop
Cloning into 'scalding'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

However, git clone https://github.com/twitter/scalding.git did work... I assume github has changed their permissions scheme somehow so that the old clone command no longer works.

It also looks as though the installation instructions for scala & sbt may be out of date. The Getting Started page mentions both sbt 0.11 and sbt 0.12. This page seems to get me Scala 2.8.1, but my impression from looking at project/Build.scala is that that's not the preferred version (the development branch seems to want 2.9.3 and the master branch seems to want 2.9.2). In general, the information in "Using Scalding with other versions of Scala" seemed quite outdated; I had a hard time finding the source files/lines of code it was referring to.

The WordCountJob example from https://github.com/twitter/scalding/wiki/Getting-Started doesn't seem to work either... I get this error (w/ scala 2.9.3):

WordCountJob.scala:5: error: invalid escape character
    .flatMap('line -> 'word) { line : String => line.split("\s+") }
                                                             ^
one error found

(Bits of code from this example are also used elsewhere on the page... overall, seems likely the entire page could use a revamping.)

A couple other suggestions to make scalding easier to install and increase adoption:

mslinn commented 7 years ago

I've been looking at Scalding for only 2 minutes, quickly scanning the README, and some issues regarding the first code example jumped out at me:

TypedPipe.from(TextLine(args("input")))
    .flatMap { line => tokenize(line) }
    .groupBy { word => word } // use each word for a key
    .size // in each group, get the size
    .write(TypedText.tsv[(String, Long)](args("output")))
  1. The groupBy would be more idiomatic if written as: groupBy(identity).
  2. The size does not do what the comment suggests. Clearly this code example never worked because the next line, with the write, would result in a compiler error.

I found a similar method in the examples directory, which looks more reasonable:

class WordCountJob(args: Args) extends Job(args) {
  TypedPipe.from(TextLine(args("input")))
    .flatMap { line => line.split("\\s+") }
    .map { word => (word, 1L) }
    .sumByKey
    // The compiler will enforce the type coming out of the sumByKey is the same as the type we have for our sink
    .write(TypedTsv[(String, Long)](args("output")))
}
johnynek commented 7 years ago

Can you explain why the comment is wrong (size does work that way, as I understand the comment). Also can you post the compiler error?

I don't see the error.

johnynek commented 7 years ago

It is confusing to a lot of people who start with scalding that the methods on grouped things apply to each group:

see: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/KeyedList.scala#L45

where almost all the methods on groups are defined.

mslinn commented 7 years ago

@johnynek I probably posted too quickly, after only a brief glance at the docs. Should know better.

I just found Intro to Scalding Jobs on the wiki. Perhaps a link in the README from the WordCountJob source code to the walkthrough might be helpful.

After reading the walkthrough, I realized that TypePipe.from returns a stream. This size method is unlike the Scala collections method of the same name in that it does not return a single Int. Instead, this size is a combinator that deals with streams, which makes sense given the nature of the library. I think that is what your most recent post was trying to tell me.

I expect that my comment about using identity is probably correct. I'll need to actually run the example Scalding code to know for sure, and that is not something I know how to do yet.

Have not yet found an explanation for why two versions of WordCountJob exist.

oscar-stripe commented 7 years ago

yes, you are correct that we could have written groupBy(identity). About the two ways to write it, it is a minor concern. There are often multiple ways to express a computation. Actually hard for me to say why you'd do one or the other here. groupBy gives you access to more operations, sumByKey gives you the common idiom of doing some Semigroup reduction for each value.