twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
995 stars 184 forks source link

Private Schema and other private methods and constructors #32

Open degloff opened 6 years ago

degloff commented 6 years ago

Is there a reason why the convenience object Schema is private?

private[timeseries] object Schema

For instance:

// preferred but not working because Schema private
val tsRdd = TimeSeriesRDD.fromRDD(sc.parallelize(data, defaultNumPartitions), Schema("time" -> LongType, "id" -> IntegerType, "price" -> DoubleType))(isSorted = true, timeUnit = TimeUnit.NANOSECONDS)

val schema = StructType(
  StructField("time", LongType) ::
    StructField("id", IntegerType) ::
    StructField("price", DoubleType) :: Nil)
val tsRdd1 = TimeSeriesRDD.fromRDD(sc.parallelize(data, defaultNumPartitions), schema)(isSorted = true, timeUnit = TimeUnit.NANOSECONDS)

Also, some TimeSeriesRDD constructors are private, which may be useful:

private[timeseries] def fromSeq(
    sc: SparkContext,
    rows: Seq[InternalRow],
    schema: StructType,
    isSorted: Boolean,
    numSlices: Int = 1
  ): TimeSeriesRDD

  private[flint] def fromOrderedRDD(
    rdd: OrderedRDD[Long, Row],
    schema: StructType
  ): TimeSeriesRDD = {
    val converter = CatalystTypeConvertersWrapper.toCatalystRowConverter(schema)
    TimeSeriesRDD.fromInternalOrderedRDD(rdd.mapValues {
      case (_, row) => converter(row)
    }, schema)
  }

Also for testing access to the OrderedRdd is valuable, but that is also private

private[flint] def orderedRdd: OrderedRDD[Long, InternalRow]

This may open the implementation too much.

icexelloss commented 6 years ago

Hi,

Schema object can probably be public for convenience, but it shouldn't be considered as stable public API.

The private constructors are purely for internal uses and can change drastically, I prefer not to open them.