rcongiu / Hive-JSON-Serde

Read - Write JSON SerDe for Apache Hive.
Other
734 stars 391 forks source link

postgresql JSON_ARRAY_ELEMENTS equivalent #47

Closed beatboxa closed 7 years ago

beatboxa commented 10 years ago

This SerDe is great Roberto!

I've got a quick question: Is there an equivalent to the json_array_elements function in postgresql? If you're not familiar with the function, it explodes an array into its elements. There's a few examples here (as well as additional postgresql json functions): http://www.postgresql.org/docs/9.3/static/functions-json.html

Example: if you had 1 record, stored in an array field as follows: [[a, b], [c,d], e] then json_array_elements would result in 3 records: [a,b] [c,d] e which could then be split into multiple columns, using a this SerDe. a | b c | d e |

This function is particularly useful for arrays that contain multiple columns since it allows you to 'explode' first and parse columns later (as opposed to parsing columns and then exploding, which results in many unwanted combinations of values if you only combinations where position is equal).

Any thoughts on this?

rcongiu commented 10 years ago

I see, but I think you'd better have that using a UDF/UDTF rather than in the SerDe. I think the SerDe should just translate whatever data structure JSON has into the hive equivalent, but if you want to 'massage' it to something else, you should do it using something you can easily turn on and off. Whatever you do in the serde is executed for every query and can't be easily be turned off. In this case, I think your [a,b] , [c,d] should be instead declared as structs/objects, like [{ f1 => a, f2 =>b}, {f1 =>c, f2 => c } , {f1 => e}] this way you could refer to them easily using explode/lateral view.

beatboxa commented 10 years ago

Interesting...I never thought of this but I guess there are 2 perspectives on the hive equivalent :) And I totally understand your perspective as well.

Suppose you have the following:

[Transaction Number: 1 [Item Array: [Item Number: 1 UPC code: 11111111 Quantity: 2] [Item Number: 2 UPC code: 33333333 Quantity: 9]]] [Transaction Number: 2 [Item Array: [Item Number: 1 UPC code: 55555555 Quantity: 1]]]

When using this SerDe, this will translate to:

Transaction Number: 1 2

Item Number: [1,2] [1]

UPC Code: [11111111,33333333] [55555555]

Quantity: [2,9] [1]

And unfortunately, we've lost the fact that these last 3 columns (originally represented by a single array of values) are all related--that position #1 of Item Number corresponds to position #1 of Quantity. These 3 columns all represent the same array in the original, and we've lost that fact.

My take is that even if a physical row is defined by a distinct "Transaction Number", I think the data actually represents a record as being an Item within each Transaction. So I see this as actually being 3 records, not 2.

One way to think about it is that the ITEM_NUMBER, UPC_CODE, and QUANTITY fields are nested within the array in the original JSON--not the other way around, so JSON_ARRAY_ELEMENTS results in a closer representation of the original JSON because it first splits the array into multiple records; and then splits these records into their fields (just like the original JSON).

By splitting the fields first and the array second, we technically can reconstruct them, but it seems to be adding a redundant step and relies on external knowledge (knowing these arrays are related)--particularly if you have multiple arrays. I believe this is why postgreSQL implemented the JSON_array_elements specifically for JSON arrays in addition to having the standard array functions (~Hive UDF's).

If we had, for example, another array in the JSON as follows (that was at a different granularity):

[Transaction Number: 3 [Item Array: [Item Number: 1 UPC code: zzz Quantity: 4]] [Coupon Array: [Coupon Number: 1 Coupon Code: 123] [Coupon Number: 2 Coupon Code: 456]]]

And we used the SerDe, we'd end up with 2 new fields (in addition to the original Item fields):

Coupon Number: [1,2]

Coupon Code: [123,456]

But, in this example, the position of Coupon Number and Coupon Code only correspond to one another, and they don't correspond to the Item Array columns above. Unfortunately, this relationship between fields has been lost within the scope of the Hive table we've constructed.

This is all just my feedback. Your SerDe is great--just wanted to give you feedback for a potential improvement or additional functionality that could be very useful to maintain the data as close to the original state as possible.

rcongiu commented 10 years ago

Yeah, we had a similar issue at work, with multiple array fields that are related like in your example:

coupon id [1,2] coupon code [ aaa,bb]

albeit it's two different columns, it should in fact be a struct (let's call it coupon info) coupon info array [  { coupon_id => 1, coupon code=> aaa} , { coupon_id => 2, coupon code => bb }]

but we've done it with a generic UDF called structify , that takes as input a bunch of arrays, and transforms them into an array of structs. You'd find it more usefult were it instead implemented in the serde ?

R.

 

"Good judgment comes from experience.

Experience comes from bad judgment"

Data Engineer - OpenX.org Pasadena, CA Skype: sardodazione Y! IM: rcongiu

On Saturday, January 18, 2014 9:29 AM, beatboxa notifications@github.com wrote:

Interesting...I never thought of this but I guess there are 2 perspectives on the hive equivalent :) And I totally understand your perspective as well.

Suppose you have the following: Transaction Number: 1 Item Array: Item Number: 1 UPC code: 11111111 Quantity: 2 Item Number: 2 UPC code: 33333333 Quantity: 9 Transaction Number: 2 Item Array: Item Number: 1 UPC code: 55555555 Quantity: 1 When using this SerDe, this will translate to: Transaction Number: 1 2 Item Number: [1,2] [1] UPC Code: [11111111,33333333] [55555555] Quantity: [2,9] [1] And unfortunately, we've lost the fact that these last 3 columns (originally represented by an array of values) are all related--that position #1 of Item Number corresponds to position #1 of Quantity. These 3 columns all represent the same array in the original, and we've lost that fact. My take is that even if a physical row is defined by a distinct "Transaction Number", I think the data actually represents a record as being an Item within each Transaction. So I see this as actually being 3 records, not 2. One way to think about it is that the ITEM_NUMBER, UPC_CODE, and QUANTITY fields are nested within the array in the original JSON--not the other way around, so JSON_ARRAY_ELEMENTS results in a closer representation of the original JSON because it first splits the array into multiple records; and then splits these records into their fields (just like the original JSON). By splitting the fields first and the array second, we technically can reconstruct them, but it seems to be adding a redundant step and relies on external knowledge (knowing these arrays are related)--particularly if you have multiple arrays. I believe this is why postgreSQL implemented the JSON_array_elements specifically for JSON arrays in addition to having the standard array functions (~Hive UDF's). If we had, for example, another array in the JSON as follows (that was at a different granularity): Transaction Number: 3 Item Array: Item Number: 1 UPC code: zzz Quantity: 4 Coupon Array: Coupon Number: 1 Coupon Code: 123 Coupon Number: 2 Coupon Code: 456 And we used the SerDe, we'd end up with 2 new fields (in addition to the original Item fields): Coupon Number: [1,2] Coupon Code: [123,456] But, in this example, the position of Coupon Number and Coupon Code only correspond to one another, and they don't correspond to the Item Array columns above. Unfortunately, this relationship between fields has been lost within the scope of the Hive table we've constructed. This is all just my feedback. Your SerDe is great--just wanted to give you feedback for a potential improvement or additional functionality that could be very useful to maintain the data as close to the original state as possible. — Reply to this email directly or view it on GitHub.

beatboxa commented 10 years ago

Oh cool! Yes, personally, I'd find it much more useful in the SerDe...It can certainly be done outside with array UDF's (such as the one you mentioned), but doing this in the SerDe would save processing steps & required space, which can make a difference at scale.

I read some blogs on how to manually extract a certain position of the array in your SerDe, but as you can imagine, this may not scale well when the array has many values (50+) and varying ranges (ie. each array can have anywhere from 1 - 100+ positions). I've also seen a bunch of posts from various people attempting to do this (with the answer always being "this functionality doesn't exist in Hive today. You need to use custom SerDe's and UDF's."

The JSON_ARRAY_ELEMENTS was a lifesaver in PostgreSQL, and it would definitely be much appreciated in this SerDe so that this effectively becomes the de facto, flexible JSON-to-Hive parser.

Two different methods I can think of would be: 1) Don't even split the array into fields. Simply allow the user to store the entire array (including all fields) in a single Hive column like the struct in your example, but without having to explicitly define the number of records & array position (for another example: http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive). This allows the user to use a native Hive function subsequently to explode the array. This was similar to the method I'd use in PostgreSQL--I'd store each entire array JSON text in a single column, then use JSON_ARRAY_ELEMENTS to explode, and then the '=>' function to extract fields. For multiple arrays, (which each define a different level of granularity), I'd fill in nulls for columns that were part of a different array (effectively exploding each array and then unioning/appending the datasets together).

2) Allow the user to do both the array 'EXPLODE' and subsequent column parsing in one step. This would be really cool if it was part of this SerDe!

Thanks for taking the time and effort to at least assess the idea, and have a great weekend!

rcongiu commented 10 years ago

So, I added the feature of handling arrays in the serde, code is in the development branch and some docs on the readme.

vnnw commented 10 years ago

I have the same ETL problem as @beatboxa encountered. Hope to use this feature in the near future! BTW. Where can I get the UDF of STRUCTIFY as mentioned earlier? Thanks!

rcongiu commented 10 years ago

I just added the capability to have records mapped to Json arrays, so you can map a table to an array [ [ a1, a2,], [b1,b2], [c1,c2]] suing a table create table mytable ( a array, b array, c attay)...

but reading the thread, and the postgres doc, it looks like you're asking basically for what LATERAL VIEW EXPLODE already does, except you'd like to do it on the whole array.

I see the examples above, and it just looks to me that json arrays were used where json object should have. Can you send me a sample of the json you'd like parsed, and how you'd like the table to look like ?

vnnw commented 10 years ago

Thanks for the rapid response!

My data looks as below which is an order containing multiple items

[["SKU_001","Product Name A","Category1",$10.5,1],["SKU002","Product Name B","Category1",$20,3]]

The size of nested array is not fixed, which means there may be more than 2 item in the example.

And I wish to transform it into such

SKU | Product Name | Category | Unit Price | Product Quantity

SKU001 | Product Name A | Category1 | $10.5 | 1

SKU002 | Product Name B | Category2 | $20 | 3

As you mentioned, it may be best using a UDF/ UDTF to to the job. I tried LATERAL VIEW EXPLODE but failed.

On Sat, Feb 22, 2014 at 1:15 PM, Roberto Congiu notifications@github.comwrote:

I just added the capability to have records mapped to Json arrays, so you can map a table to an array [ [ a1, a2,], [b1,b2], [c1,c2]] suing a table create table mytable ( a array, b array, c attay)...

but reading the thread, and the postgres doc, it looks like you're asking basically for what LATERAL VIEW EXPLODE already does, except you'd like to do it on the whole array.

I see the examples above, and it just looks to me that json arrays were used where json object should have. Can you send me a sample of the json you'd like parsed, and how you'd like the table to look like ?

— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/47#issuecomment-35795069 .

Best wishes,

Vinny

.. Life is just what happens to you While you're busy making other plans

vnnw commented 10 years ago

My data looks as below which is an order containing multiple items

[["SKU_001","Product Name A","Category1",10.5,1],["SKU002","Product Name B","Category1",20,3]]

The size of nested array is not fixed, which means there may be more than 2 item in the example.

And I wish to transform it into such

SKU | Product Name | Category | Unit Price | Product Quantity SKU001 | Product Name A | Category1 | 10.5 | 1 SKU002 | Product Name B | Category2 | 20 | 3

As you mentioned, it may be best using a UDF/ UDTF to to the job. I tried LATERAL VIEW EXPLODE but failed.

vnnw commented 10 years ago

I treat the data as string type (which as it is) and do some regex_replace and split operations, turn it into arrays type, and finally I can LATERAL VIEW EXPLODE it.

But my method is more like a workaround, maybe Hive needs an elegant UDF like json_array_element in postgressql.

On Sat, Feb 22, 2014 at 1:55 PM, Yuanlin Wu wyl2000@gmail.com wrote:

Thanks for the rapid response!

My data looks as below which is an order containing multiple items

[["SKU_001","Product Name A","Category1",$10.5,1],["SKU002","Product Name B","Category1",$20,3]]

The size of nested array is not fixed, which means there may be more than 2 item in the example.

And I wish to transform it into such

SKU | Product Name | Category | Unit Price | Product Quantity

SKU001 | Product Name A | Category1 | $10.5 | 1

SKU002 | Product Name B | Category2 | $20 | 3

As you mentioned, it may be best using a UDF/ UDTF to to the job. I tried LATERAL VIEW EXPLODE but failed.

On Sat, Feb 22, 2014 at 1:15 PM, Roberto Congiu notifications@github.comwrote:

I just added the capability to have records mapped to Json arrays, so you can map a table to an array [ [ a1, a2,], [b1,b2], [c1,c2]] suing a table create table mytable ( a array, b array, c attay)...

but reading the thread, and the postgres doc, it looks like you're asking basically for what LATERAL VIEW EXPLODE already does, except you'd like to do it on the whole array.

I see the examples above, and it just looks to me that json arrays were used where json object should have. Can you send me a sample of the json you'd like parsed, and how you'd like the table to look like ?

— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/47#issuecomment-35795069 .

Best wishes,

Vinny

.. Life is just what happens to you While you're busy making other plans

Best wishes,

Vinny

.. Life is just what happens to you While you're busy making other plans