prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.92k stars 5.33k forks source link

When query a column of type map<string, string>, the keys of the map are converted to lowercase #20701

Closed yhwang closed 3 months ago

yhwang commented 1 year ago

When running the following query:

SELECT MAP(ARRAY['myFirstRow', 'mySecondRow'], ARRAY[cast(row('row1FieldValue1', 'row1FieldValue2') as row("firstField" varchar, "secondField" varchar)), cast(row('row2FieldValue1', 'row2FieldValue2') as row("firstField" varchar, "secondField" varchar))]) as mapField;

I got the following results:

                                                                   mapField                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------
 {mySecondRow={firstfield=row2FieldValue1, secondfield=row2FieldValue2}, myFirstRow={firstfield=row1FieldValue1, secondfield=row1FieldValue2}} 
(1 row)

The keys of the map are firstField and secondField in the query. However, they become firstfield and secondfield in the query results.

Your Environment

Expected Behavior

The result should preserve the original keys without lowercase them

Current Behavior

The keys of a map are converted to lowercase

Possible Solution

Not sure this is a CLI or server side issue

Steps to Reproduce

  1. start a presto CLI
  2. run the following query: SELECT MAP(ARRAY['myFirstRow', 'mySecondRow'], ARRAY[cast(row('row1FieldValue1', 'row1FieldValue2') as row("firstField" varchar, "secondField" varchar)), cast(row('row2FieldValue1', 'row2FieldValue2') as row("firstField" varchar, "secondField" varchar))]) as mapField;

Screenshots (if appropriate)

Context

milescrawford commented 1 year ago

This also seems to affect the contents of the column, which I think is a bigger issue.

Presto is reading from a json table like this:

{"corpusid":186849207,"externalids":{"ACL":null,"DBLP":null,"ArXiv":null,"MAG":"2914916907","CorpusId":"186849207", ...

Output of presto query: select * from table limit 1

#   corpusid    corpusid    externalids ...
1   186777859   186777859   {mag=2783694422, corpusid=186777859, pubmed=null, pubmedcentral=null, arxiv=null, acl=null, dblp=null, doi=10.29171/azu_acku_risalah_jq1765_a55_alif449a_1394} ...

The actual contents of map<string, string> have been lowercased. If just the column name externalid or even the struct keys as in the original report were lowercased, maybe that's okay. But the strings themselves stored in the column? That seems like data corruption.

yhwang commented 1 year ago

@milescrawford can you share more info about what data source and connector you are using? In the meantime, can you help to verify one thing:

If the data is wrong, the problem would be even bigger. But I hope this is not the case.

tdcmeehan commented 1 year ago

In particular, when you ETL it into a separate table, does reading the table with another tool reproduce the issue? e.g. Spark or manually inspecting the file

milescrawford commented 1 year ago

Yes, this is a presto query via AWS athena, using the UNLOAD ... TO ... format, and the output json is also changed:

{"corpusid":208256695,"externalids":{"acl":null,"dblp":null,"arxiv":null,"mag":null,"corpusid":"208256695","pubmed":null, ...
tdcmeehan commented 1 year ago

@milescrawford how are you inspecting the output JSON--is that through an Athena query?

milescrawford commented 1 year ago

no, the output json is consumed by another application. I am inspecting it and copying it to here manually by running aws s3 cp <url> - |gzcat |head

hantangwangd commented 8 months ago

I think the two problems are both involving CAST. When we try to build a CAST expression, we should retain case for the quoted field names in target type string. And when we try to CAST json string to other type like RowType, we should not roughly translate the json string to lower case, that would corrupt the json data.

I have create a PR #21602 which appears to address this issue.

tdcmeehan commented 8 months ago

The fix in #21602 will be released in 0.286 and the next edge release.

mbasmanova commented 7 months ago

Reopening since the original fix was reverted.