Closed reggieperry closed 1 year ago
Hi!
Parquet itself does not support fields of type Any. You need to specify a fixed type. So I suggest you change the model of DataMap. For example, you can have two maps: stringIds: Map[String, String]
and decimalIds: Map[String, BigDecimal]
.
Unfortunately, there’s way too much legacy code that depends on this. Can I dynamically generate the TypedSchemaDef via Ref[A] somehow? How is it that the SchemaDef I wrote actually works? I didn’t reason it out so much as I tried different things.
The schema is for the whole Parquet file - not for a single row. So, if you keep writing decimals to one file, and then all strings to another (with another schema) - then it will work.
However, you can expect later problems with reading files with conflicting schemas.
The thing is that I wrote the encoder to always write strings but it seems like the type of the input data is checked against the output schema as opposed to the encoder output being validated against the schema. So if I change that map to use stringSchema instead of decimalSchema, it fails to compile.
if the value of the Map[String,Any] can be of a finite set of possibilities (i.e either the value is a string or it is a long then I think the structure could feasibly be described as an Either.
Is there support for an Either structure? (ie a Map described as a Map[String, Either[String,Long]])
Of course, there is :)
As I said before - do not insist on saving heterogeneous values of a map to a single collection. Partition your map into two: one for strings and the second for decimals. E.g. you can encode Ref
directly as a RowParquetRecord
if creating an intermediary case class is such a problem:
implicit def myEncoder[T]: OptionalValueEncoder[Ref[T]] =
new OptionalValueCodec[CustomType] {
override def encodeNonNull(ref: Ref[T], configuration: ValueCodecConfiguration): Value =
RowParquetRecord("type" -> [type as string], "stringIds" -> MapParquetRecord(stringIds entries), "decimalIds" -> MapParquetRecord([decimalIds entries])
}
And define a corresponding groupSchema
.
There's another low-level option - you can implement a custom version of MapParquetRecord, which writes several types of map entries: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L814 (not strictly one type, as it is done now).
However, I do not recommend it because it would be a non-standard approach to a map and reading such a map would be a challenge using any existing application/framework.
my map seems to write okay however when I attempt to read it in parquet tools, I get a ArrowInvalid: Map keys must be provided. Is there something I need to explicitly do to add the annotation here?
implicit def refSchema[A <: MyObject[_]](implicit stringSchema: TypedSchemaDef[String]): TypedSchemaDef[Ref[A]] = { SchemaDef .group( stringSchema("type"), SchemaDef.map(stringSchema, stringSchema)("ids") ).typed[Ref[A]] }
Hi there. I have an existing case class:
abstract case class Ref[+T ](
type: T, ids: DataMap)
where the DataMap type is:type DataMap = collection.immutable.Map[String, Any]
.Now the way things are currently, the "Any" type only takes on the type String or BigDecimal (I'm using Scala 2.12 currently). I can define a
TypedSchemaDef[Ref[A]]
which looks like this:When the DataMap value being encoded is String -> BigDecimal. The problem is it can also be String -> String and when this is the case, I haven't been able to create a proper TypedSchemaDef. I would like to just encode the values of the map as String in all cases and I can then have the OptionalValueCodec do the correct thing.
Is there a way I can solve this? Thanks for your help.