oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 77 forks source link

Inconsistent result in decoding gbk encoded data in url decoder #1116

Open PHILO-HE opened 1 year ago

PHILO-HE commented 1 year ago

The root case is the inconsistent decoding for GBK encoded data when converting it to utf-8. The hive UDF calls JDK lib to do the decoding with utf-8 decoder. In our columnar UDF implementation, the decoding is GBK decoding like. It is implementation-dependent in decoding non-utf-8 data.