Frameless is a Scala library for working with Spark using more expressive types. It consists of the following modules:
frameless-dataset
for a more strongly typed Dataset
/DataFrame
APIframeless-ml
for a more strongly typed Spark ML API based on frameless-dataset
frameless-cats
for using Spark's RDD
API with catsNote that while Frameless is still getting off the ground, it is very possible that breaking changes will be made for at least the next few versions.
The Frameless project and contributors support the Typelevel Code of Conduct and want all its associated channels (e.g. GitHub, Discord) to be a safe and friendly environment for contributing and learning.
The compatible versions of Spark and cats are as follows:
Frameless | Spark | Cats | Cats-Effect | Scala |
---|---|---|---|---|
0.16.0 | 3.5.0 / 3.4.0 / 3.3.0 | 2.x | 3.x | 2.12 / 2.13 |
0.15.0 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
0.14.1 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
0.14.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
0.13.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
0.12.0 | 3.2.1 / 3.1.3 / 3.0.3 | 2.x | 3.x | 2.12 / 2.13 |
0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
0.11.0* | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
0.10.1 | 3.1.0 | 2.x | 2.x | 2.12 |
0.9.0 | 3.0.0 | 1.x | 1.x | 2.12 |
0.8.0 | 2.4.0 | 1.x | 1.x | 2.11 / 2.12 |
0.7.0 | 2.3.1 | 1.x | 1.x | 2.11 |
0.6.1 | 2.3.0 | 1.x | 0.8 | 2.11 |
0.5.2 | 2.2.1 | 1.x | 0.8 | 2.11 |
0.4.1 | 2.2.0 | 1.x | 0.8 | 2.11 |
0.4.0 | 2.2.0 | 1.0.0-IF | 0.4 | 2.11 |
* 0.11.0 has broken Spark 3.1.2 and 3.0.1 artifacts published.
Starting 0.11 we introduced Spark cross published artifacts:
-spark{major}{minor}
is added to artifacts that are released for the previous Spark version(s)Artifact names examples:
frameless-dataset
(the latest Spark dependency)frameless-dataset-spark33
(Spark 3.3.x dependency)frameless-dataset-spark32
(Spark 3.2.x dependency)Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.
The only dependency of the frameless-dataset
module is on shapeless 2.3.2.
Therefore, depending on frameless-dataset
, has a minimal overhead on your Spark's application jar.
Only the frameless-cats
module depends on cats and cats-effect, so if you prefer to work just with Datasets
and not with RDD
s,
you may choose not to depend on frameless-cats
.
Frameless intentionally does not have a compile dependency on Spark. This essentially allows you to use any version of Frameless with any version of Spark. The aforementioned table simply provides the versions of Spark we officially compile and test Frameless with, but other versions may probably work as well.
Frameless introduces a new Spark API, called TypedDataset
.
The benefits of using TypedDataset
compared to the standard Spark Dataset
API are as follows:
Click here for a
detailed comparison of TypedDataset
with Spark's Dataset
API.
Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x.
To use Frameless in your project add the following in your build.sbt
file as needed:
val framelessVersion = "<latest version>"
resolvers ++= Seq(
// for snapshot artifacts only
"s01-oss-sonatype" at "https://s01.oss.sonatype.org/content/repositories/snapshots"
)
libraryDependencies ++= List(
"org.typelevel" %% "frameless-dataset" % framelessVersion,
"org.typelevel" %% "frameless-ml" % framelessVersion,
"org.typelevel" %% "frameless-cats" % framelessVersion
)
An easy way to bootstrap a Frameless sbt project:
g8 imarios/frameless.g8
sbt new imarios/frameless.g8
Typing sbt console
inside your project will bring up a shell with Frameless
and all its dependencies loaded (including Spark).
Feel free to messages us on our discord channel for any issues/questions.
We require at least one sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers (people who can merge pull requests) are:
Frameless contains several property tests. To avoid OutOfMemoryError
s, we
tune the default generator sizes. The following environment variables may
be set to adjust the size of generated collections in the TypedDataSet
suite:
Property | Default |
---|---|
FRAMELESS_GEN_MIN_SIZE | 0 |
FRAMELESS_GEN_SIZE_RANGE | 20 |
Code is provided under the Apache 2.0 license available at http://opensource.org/licenses/Apache-2.0, as well as in the LICENSE file. This is the same license used as Spark.