Closed yhwang closed 3 months ago
This also seems to affect the contents of the column, which I think is a bigger issue.
Presto is reading from a json table like this:
{"corpusid":186849207,"externalids":{"ACL":null,"DBLP":null,"ArXiv":null,"MAG":"2914916907","CorpusId":"186849207", ...
Output of presto query: select * from table limit 1
# corpusid corpusid externalids ...
1 186777859 186777859 {mag=2783694422, corpusid=186777859, pubmed=null, pubmedcentral=null, arxiv=null, acl=null, dblp=null, doi=10.29171/azu_acku_risalah_jq1765_a55_alif449a_1394} ...
The actual contents of map<string, string>
have been lowercased. If just the column name externalid
or even the struct keys as in the original report were lowercased, maybe that's okay. But the strings themselves stored in the column? That seems like data corruption.
@milescrawford can you share more info about what data source and connector you are using? In the meantime, can you help to verify one thing:
If the data is wrong, the problem would be even bigger. But I hope this is not the case.
In particular, when you ETL it into a separate table, does reading the table with another tool reproduce the issue? e.g. Spark or manually inspecting the file
Yes, this is a presto query via AWS athena, using the UNLOAD ... TO ...
format, and the output json is also changed:
{"corpusid":208256695,"externalids":{"acl":null,"dblp":null,"arxiv":null,"mag":null,"corpusid":"208256695","pubmed":null, ...
@milescrawford how are you inspecting the output JSON--is that through an Athena query?
no, the output json is consumed by another application. I am inspecting it and copying it to here manually by running aws s3 cp <url> - |gzcat |head
I think the two problems are both involving CAST
. When we try to build a CAST expression, we should retain case for the quoted field names in target type string. And when we try to CAST json string to other type like RowType, we should not roughly translate the json string to lower case, that would corrupt the json data.
I have create a PR #21602 which appears to address this issue.
The fix in #21602 will be released in 0.286 and the next edge release.
Reopening since the original fix was reverted.
When running the following query:
I got the following results:
The keys of the map are
firstField
andsecondField
in the query. However, they becomefirstfield
andsecondfield
in the query results.Your Environment
Expected Behavior
The result should preserve the original keys without lowercase them
Current Behavior
The keys of a map are converted to lowercase
Possible Solution
Not sure this is a CLI or server side issue
Steps to Reproduce
SELECT MAP(ARRAY['myFirstRow', 'mySecondRow'], ARRAY[cast(row('row1FieldValue1', 'row1FieldValue2') as row("firstField" varchar, "secondField" varchar)), cast(row('row2FieldValue1', 'row2FieldValue2') as row("firstField" varchar, "secondField" varchar))]) as mapField;
Screenshots (if appropriate)
Context