Incorrect comparisson of parquet binary statistics for accented characters

igorcalabria commented 5 years ago

On version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201

I have added a simple test case on TupleDomainParquetPredicateTest that should not fail

    @Test
    public void testAccentedString() throws ParquetCorruptionException {
        String column = "StringColumn";

        assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
                true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
    }

but it fails with

com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]

Áncash comes before China, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics. As additional information, the files that led me to this error were generated by spark

nezihyigitbasi commented 5 years ago

/cc @zhenxiao

nezihyigitbasi commented 5 years ago

@igorcalabria I took a quick look, and if I do "Áncash".compareTo("China") or SliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0 I still get a positive number (so the test still fails), which means "Áncash" lexicographically follows "China". Can you please double check whether your test case is valid?

I think we should replace minSlice.compareTo(maxSlice) > 0 with SliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0 anyways. @zhenxiao what do you think?

igorcalabria commented 5 years ago

@nezihyigitbasi I'm pretty sure that C does not precede À, at least for a decent number of languages. If I'm not mistaken, the "naive" sorting will always put special characters after the "normal" ones due to the order on the UTF-8 table. To have proper ordering of text, you need to use a collator.

I don't necessarily agree that the min/max statistics should have the correct alphabetic ordering, but if more writers have the same default behaviour as the spark one, presto should at least have an option to deal with it. I'll check if hive also generates that same statistics for my data and I'll report back. Thanks for the quick answers.

nezihyigitbasi commented 5 years ago

Just to make sure you and others have a workaround, these stats checks can be turned off with the hive.parquet.fail-on-corrupted-statistics config property and the parquet_fail_with_corrupted_statistics session property.

findepi commented 5 years ago

I'm pretty sure that C does not precede À, at least for a decent number of languages. If I'm not mistaken, the "naive" sorting will always put special characters after the "normal" ones due to the order on the UTF-8 table. To have proper ordering of text, you need to use a collator.

@igorcalabria you're right. But I recall there is no "universal collator", this seems locale dependent. Is writer locale information stored in Parquet file footer?

zhenxiao commented 5 years ago

good catch @igorcalabria As @nezihyigitbasi said, you could disable statistic check by setting: hive.parquet.fail-on-corrupted-statistics=false

thank you @nezihyigitbasi I will propose a fix

zhenxiao commented 5 years ago

@nezihyigitbasi @findepi @igorcalabria seems Parquet Footer does not have dedicated Locale data(fix me if I am incorrect), maybe optional in key/value pairs in File Footer? @igorcalabria did Spark write it in extra key/value pairs in Parquet Footer?

igorcalabria commented 5 years ago

@zhenxiao This is getting weird, both parquet-tools.jar and arrow's parquet-reader are reporting no statistics for that column. I'll try to fetch the footer using a lower level lib.

zhenxiao commented 5 years ago

I get similar result with @nezihyigitbasi , both positive numbers for: "Áncash".compareTo("china") SliceUtf8.compareUtf16BE(utf8Slice("Áncash"), utf8Slice("china"))

not sure, is it Locale dependent?

hwb1992 commented 5 years ago

Just to make sure you and others have a workaround, these stats checks can be turned off with the hive.parquet.fail-on-corrupted-statistics config property and the parquet_fail_with_corrupted_statistics session property.

In fact, we encounter a problem for parquet file.We use hive 2.3.3 and parquet-mr 1.8.1.When we query data for map,there occured error Corrupted statistics for column "[xxx_property_name, map, value] BINARY" in Parquet file "xxxxx": [min: ٠٫٠℃, max: 99%, num_nulls: 0]. I read parquet source code and presto master code.It seems they all compare with unsigned.But i don't know how this error happened.I also change the compare method with SliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0 ,it also occurs error. If i turned off hive.parquet.fail-on-corrupted-statistics, everything is ok,but i don't know if i turned off the config,what's the disadvantage of this.Can you tell me how presto use the parquet statistic to get data quickly.Thanks. @nezihyigitbasi @zhenxiao

zhenxiao commented 5 years ago

Hi @hwb1992 Presto is leveraging Parquet statistics to improve performance, e.g. if there is a predicate: a > 10, and the max value of column a is 5, then Presto will skip reading this Parquet row group.

While, in case we found weird Parquet statistics, e.g. a Parquet column min value is bigger max value, we could either: a. error out loudly b. ignore the statistics, scan the whole row group

hive.parquet.fail-on-corrupted-statistics controls the behavior of options a and b.

hwb1992 commented 5 years ago

Hi @hwb1992 Presto is leveraging Parquet statistics to improve performance, e.g. if there is a predicate: a > 10, and the max value of column a is 5, then Presto will skip reading this Parquet row group.

While, in case we found weird Parquet statistics, e.g. a Parquet column min value is bigger max value, we could either: a. error out loudly b. ignore the statistics, scan the whole row group

hive.parquet.fail-on-corrupted-statistics controls the behavior of options a and b.

Thank you very much.Can you tell me what parquet version do presto use? Preso integrates parquet source code.So i don't know what version is it.I think the reason why occured to me is our parquet version is different. We make a experiment which we create a simple table A with a map.The construct of map is {"name":"xxx","address":"xx"}.And there only exists one name for the map.For example,we insert {"name":"tom","address":"china"},{"name":"tom","address":"asina"} And we select min(map['name']) from A in hive , select max(map['name']) from Ain hive.They return tom. But we select min(map['name']) from Ain presto,it may occurs error Corrupted statistics for column "[map, map, value] BINARY" in Parquet file "xxxxx": [min: tom, max: china, num_nulls: 0]. .As you see, the max of parquet statistic is max value of address.And i select max(map['address']) from A in hive ,it return china. So i think it may beacuse of different parquet version.Presto reads the wrong statistic for map. @zhenxiao

hwb1992 commented 5 years ago

it's that <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-hadoop-bundle</artifactId> <version>1.6.0</version> </dependency>

Have presto work well in different parquet version? Our presto version in hive is parquet-mr 1.8.1. We find that the parquet.thrift exists some different in parquet-mr 1.8.1. and 1.6.0.

hwb1992 commented 5 years ago

it seems a bug.I write a parque,schema is

message test {
  optional int32 int32_field;
  optional binary string_field (UTF8);
  optional int32 int32_field2;
  optional group map_info (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
}

The map data i insert is

("first_key1", "١٠٠٪");
("first_key2", "99%");

When i read the statistic of map ,the result is

min: first_key1, max: first_key2, num_nulls: 0
min: ١٠٠٪, max: 99%, num_nulls: 0

,however,the min is bigger than max in presto if i run the code

BinaryStatistics binaryStatistics = (BinaryStatistics) statistics;
Slice minSlice = Slices.wrappedBuffer(binaryStatistics.getMin().getBytes());
Slice maxSlice = Slices.wrappedBuffer(binaryStatistics.getMax().getBytes());
if (minSlice.compareTo(maxSlice) > 0) {
  failWithCorruptionException(failOnCorruptedParquetStatistics, column, id, binaryStatistics);
  return Domain.create(ValueSet.all(type), hasNullValue);
}

It seems hive 2.3 will ignore the binary min and max statistic in parquet-mr 1.8.0.So i wonder if you encounter these problem? @zhenxiao @nezihyigitbasi

hwb1992 commented 5 years ago

The compare strategy of parquet and presto seems different.

zhenxiao commented 5 years ago

Hi @hwb1992 We did not see this problem To ignore the sensitivity of min max correctness, you could turn offhive.parquet.fail-on-corrupted-statistics config property and the parquet_fail_with_corrupted_statistics

hwb1992 commented 5 years ago

All right. I upload my data.You can check it if you want. test.parquet.zip. In parquet,the min and max statistic of map.value is min: ١٠٠٪, max: 99%, num_nulls: 0,but presto think the min value(١٠٠٪) is bigger than max value(99%). @zhenxiao

thoralf-gutierrez commented 5 years ago

We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.

If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.

Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don't mean to imply that Parquet's order is correct, while Presto's isn't -- this is just an example)

In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.

In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.

Am I missing something?

tooptoop4 commented 5 years ago

bump @nezihyigitbasi

thoralf-gutierrez commented 5 years ago

ping @findepi -- can someone from the Presto team take a look at this?

findepi commented 5 years ago

@thoralf-gutierrez I think I know how this should be addressed. Let's talk on Presto Community Slack (https://trino.io/community.html).

thoralf-gutierrez commented 5 years ago

FYI the discussion is happening in the #parquet-stats channel and we've created a new issue in the new repo here : https://github.com/prestosql/presto/issues/1462

aidan-melen commented 4 years ago

What is the parquet_fail_with_corrupted_statistics session property?

prestodb / presto

Incorrect comparisson of parquet binary statistics for accented characters #12338