mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Dynamic mapping of names from source to target case classes #222

Closed butleradamj closed 2 years ago

butleradamj commented 3 years ago

I'm curious if there is a way to do this mapping.

Given my source case class:

case class ForStoring(name: String, valueA: Double, valueB: Double, valueC: Double, valueD: Double etc ... )

I want to read only certain columns into a more generic case class:

case class ForReading(name: String, primaryValue: Double, secondaryValue: Double)

Sometimes I'll want to map from valueA -> primaryValue, valueB -> secondaryValue and other times I might need to map from valueA -> primaryValue, valueD - secondaryValue, with a large number of potential scenarios.

I know I can do this by reading as GenericRecords, but if I could do this using withProjection I get the significant performance boost of only reading the desired columns.

In SQL terms I might think of it like: Select valueA as primaryValue, valueB as secondaryValue from...

mjakubowski84 commented 3 years ago

Hi @butleradamj,

Under what condition you want to switch source columns? Is it based on value of other column? Can you give a more detailed example?

butleradamj commented 3 years ago

@mjakubowski84 Thanks for the response and the great library. I was trying to be generic but let me provide a little more detail.

We have source data from sensors that measure mechanical systems. There are a very large number of sensor readings. So our parquet files could have hundreds or thousands of columns.

Typically though when doing actual analysis, we are very focused on only one or two measurements at a time. The measurements we need to read change very frequently.

In order to get the best read performance from parquet/parquet4s we would like to leverage withProjection. However, in order for that to work, I would need to create case classes for every possible combo of measurements we want to analyze. This could result in creating hundreds of case classes with only a couple attributes in each.

This is where my request comes in, I was hoping to see if there is a way I could create one case class, and then map from specific measurement names to a generic name while still getting the performance benefit of the withProjection.

Hope this helps.

marcinaylien commented 3 years ago

Unfortunately, projection will not help here much. Parquet (as a file system) stores data in a form of row groups and files are read row group by row group, too. Row group is the minimal level at which any optimisation can be achieved. More or less, it means that you have to know which columns you want to read before you read the file. You can leverage filtering though. That is, if the files are partitioned then you can choose which measurement you want to read and you can choose to process it any way you want to read it. Generally, I recommend you to change the way how the files are stored. Partition the data by measurement type (I guess that data is already partitioned by timeseries, just add another level). Consider to store measurements using generic schema.

butleradamj commented 3 years ago

@marcinaylien Thank you for the follow up. I was ultimately able to get things work, but I had to work directly with the underlying parquet-mr library. If sharing code would help better demonstrate what I was thinking, I can clean it up post it here. In case if gives you ideas on potential expansion of parquet4s.

mjakubowski84 commented 3 years ago

Sure, any hint is welcome!

pon., 4 paź 2021, 22:46 użytkownik butleradamj @.***> napisał:

@marcinaylien https://github.com/marcinaylien Thank you for the follow up. I was ultimately able to get things work, but I had to work directly with the underlying parquet-mr library. If sharing code would help better demonstrate what I was thinking, I can clean it up post it here. In case if gives you ideas on potential expansion of parquet4s.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mjakubowski84/parquet4s/issues/222#issuecomment-933841972, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACY66ZVOLYLIJ3C2MGE7R4TUFIHAVANCNFSM5FFMCD4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mjakubowski84 commented 2 years ago

Parquet4s now supports SQL-like column projection and aliases in generic records. Convertion from generic records to case classes is also simplified