tweag / sparkle

Haskell on Apache Spark.
BSD 3-Clause "New" or "Revised" License
447 stars 30 forks source link

How to operate on `RDD Row`? #151

Closed bwbaugh closed 2 years ago

bwbaugh commented 5 years ago

As a beginner, how does this library want us to operate on RDD Row instances?

Some things that I’ve tried:


v <- Spark.keyBy (Spark.closure $ static (Spark.getString 3)) myrdd :: IO (Spark.PairRDD Text Spark.Row)
        • Couldn't match type ‘IO Text’ with ‘Text’
          Expected type: IO (Spark.PairRDD Text Spark.Row)
            Actual type: IO (Spark.PairRDD (IO Text) Spark.Row)
        • In a stmt of a 'do' block:
                   v <- Spark.keyBy
                          (Spark.closure $ static (Spark.getString 3)) myrdd ::
                          IO (Spark.PairRDD Text Spark.Row)

Trying a constant that ignores the input row for testing purposes:

v <- Spark.keyBy (Spark.closure $ static (\row -> "test" )) myrdd :: IO (Spark.PairRDD Text Spark.Row)
        • No instance for (Spark.Static
                             (jvm-0.4.2:Language.Java.Reify Spark.Row))
            arising from a use of ‘Spark.keyBy’
        • In a stmt of a 'do' block:
                   v <- Spark.keyBy
                          (Spark.closure $ static (\ row -> "test")) myrdd ::
                          IO (Spark.PairRDD Text Spark.Row)

v <- [java| $myrdd.keyBy(x -> x.getString(3)) |] :: IO (Spark.PairRDD Text Spark.Row)
error: cannot find symbol
      symbol:   method getString(int)
      location: variable x of type Object

Are we not supposed to use RDD Row but instead use tuples like RDD (Text, Text, Double, Text) (like issue-124 might suggest)? If so, is there a simple example of creating/getting/using an Encoder for the as function? Is this even possible? I am having a little difficulty following the example in apps/dataframe, which also creates a specialized Tuple2: https://github.com/tweag/sparkle/blob/6086b54b1a0240d53c20fac5e61d7f88eec1aeca/apps/dataframe/Main.hs#L29

facundominguez commented 5 years ago

There is an old issue discussing using rows: https://github.com/tweag/sparkle/issues/93

I'm afraid there is not a final way to use them yet. As you learn sparkle, I'm sure you'll find multiple spots that can be improved like this one.

facundominguez commented 2 years ago

Closing this due to inactivity. But please, feel free to reopen.