mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Question. How to define a TypedSchemaDef for a Map[String, Any] #294

Closed reggieperry closed 1 year ago

reggieperry commented 1 year ago

Hi there. I have an existing case class: abstract case class Ref[+T ](type: T, ids: DataMap) where the DataMap type is: type DataMap = collection.immutable.Map[String, Any].

Now the way things are currently, the "Any" type only takes on the type String or BigDecimal (I'm using Scala 2.12 currently). I can define a TypedSchemaDef[Ref[A]] which looks like this:

implicit def refSchema[A, V](implicit decimalSchema: TypedSchemaDef[BigDecimal], stringSchema: TypedSchemaDef[String]): TypedSchemaDef[Ref[A]] = {
      SchemaDef
        .group(
          stringSchema("type"),
          SchemaDef.map(stringSchema, decimalSchema)("ids")
        ).typed[Ref[A]]
    }

When the DataMap value being encoded is String -> BigDecimal. The problem is it can also be String -> String and when this is the case, I haven't been able to create a proper TypedSchemaDef. I would like to just encode the values of the map as String in all cases and I can then have the OptionalValueCodec do the correct thing.

Is there a way I can solve this? Thanks for your help.

mjakubowski84 commented 1 year ago

Hi! Parquet itself does not support fields of type Any. You need to specify a fixed type. So I suggest you change the model of DataMap. For example, you can have two maps: stringIds: Map[String, String] and decimalIds: Map[String, BigDecimal].

reggieperry commented 1 year ago

Unfortunately, there’s way too much legacy code that depends on this. Can I dynamically generate the TypedSchemaDef via Ref[A] somehow? How is it that the SchemaDef I wrote actually works? I didn’t reason it out so much as I tried different things.

mjakubowski84 commented 1 year ago

The schema is for the whole Parquet file - not for a single row. So, if you keep writing decimals to one file, and then all strings to another (with another schema) - then it will work.

However, you can expect later problems with reading files with conflicting schemas.

reggieperry commented 1 year ago

The thing is that I wrote the encoder to always write strings but it seems like the type of the input data is checked against the output schema as opposed to the encoder output being validated against the schema. So if I change that map to use stringSchema instead of decimalSchema, it fails to compile.

normana400 commented 1 year ago

if the value of the Map[String,Any] can be of a finite set of possibilities (i.e either the value is a string or it is a long then I think the structure could feasibly be described as an Either.

Is there support for an Either structure? (ie a Map described as a Map[String, Either[String,Long]])

mjakubowski84 commented 1 year ago

Of course, there is :) As I said before - do not insist on saving heterogeneous values of a map to a single collection. Partition your map into two: one for strings and the second for decimals. E.g. you can encode Ref directly as a RowParquetRecord if creating an intermediary case class is such a problem:

implicit def myEncoder[T]: OptionalValueEncoder[Ref[T]] = 
  new OptionalValueCodec[CustomType] {
    override def encodeNonNull(ref: Ref[T], configuration: ValueCodecConfiguration): Value =
      RowParquetRecord("type" -> [type as string], "stringIds" -> MapParquetRecord(stringIds entries), "decimalIds" -> MapParquetRecord([decimalIds entries])
}

And define a corresponding groupSchema.

mjakubowski84 commented 1 year ago

There's another low-level option - you can implement a custom version of MapParquetRecord, which writes several types of map entries: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L814 (not strictly one type, as it is done now).

However, I do not recommend it because it would be a non-standard approach to a map and reading such a map would be a challenge using any existing application/framework.

normana400 commented 1 year ago

my map seems to write okay however when I attempt to read it in parquet tools, I get a ArrowInvalid: Map keys must be provided. Is there something I need to explicitly do to add the annotation here? implicit def refSchema[A <: MyObject[_]](implicit stringSchema: TypedSchemaDef[String]): TypedSchemaDef[Ref[A]] = { SchemaDef .group( stringSchema("type"), SchemaDef.map(stringSchema, stringSchema)("ids") ).typed[Ref[A]] }