prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

CAST(varchar as <integral type>) doesn't trim leading and trailing whitespaces, but fails #21315

Closed gggrace14 closed 4 days ago

gggrace14 commented 1 year ago

Current behavior of casting varchar to integral types is to throw exception if the varchar is valid but contains leading or trailing whitespaces. For example,

SELECT CAST('100 ' AS BIGINT)

throws the following exception

com.facebook.presto.spi.PrestoException: Cannot cast '100 ' to BIGINT
    at com.facebook.presto.type.VarcharOperators.castToBigint(VarcharOperators.java:185)

This is different from the behavior described in ANSI SQL

... The declared type of the result of the \ is TD... let SD be the declared type of the \... let SV be its value... If SD is character string, then SV is replaced by SV with any leading or trailing \s removed. Case: i) If SV does not comprise a \, then an exception condition is raised. ii) Otherwise, let LT be that \. The \ is equivalent to CAST ( LT AS TD )

Accordingly, TRY(CAST(varchar AS ) and TRY_CAST(varchar AS ) return NULL rather than the actual valid integral number.

Should Presto change the above CAST behavior to accept and trim leading and trailing whitespaces and be compliant with ANSI SQL?

The exception is thrown because the implementation of casting varchar to integral types directly delegates to Java, ex. https://github.com/prestodb/presto/blob/04138bc26a4fc801630fe936463d89d071d00c0a/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L176-L187 And Java does not accept whitespaces or any character other than decimal digits and sign character. https://docs.oracle.com/javase/8/docs/api/java/lang/Long.html#parseLong-java.lang.String-

Your Environment

Expected Behavior

Casting of varchar trims leading and trailing whitespaces.

Current Behavior

Casting of varchar to integral types throws exception if the varchar contains leading or trailing whitespaces

Possible Solution

Casting of varchar trims leading and trailing whitespaces before delegating to Java Long.parseLong().

Steps to Reproduce

Screenshots (if appropriate)

Context

Throwing exception makes Presto have a different semantics of above CAST from ANSI SQL and other mainstream engines or libraries that compliant with ANSI SQL, ex., MySQL, folly, etc.

gggrace14 commented 1 year ago

cc @mbasmanova @kaikalur

mbasmanova commented 1 year ago

CC: @aditi-pandit @tdcmeehan @majetideepak

mbasmanova commented 1 year ago

CC: @amitkdutta

mbasmanova commented 1 year ago

Velox currently allows leading and trailing whitespace in cast(varchar as ). We are going to keep this behavior. It would be nice to change Presto to match as well.

agrawaldevesh commented 1 year ago

Interesting! I am just curious what do other systems like MySQL, duckdb etc do here ? This should be easily fixable in presto Java without breaking any client assumptions.

spershin commented 1 year ago

I believe this case should be easily fixable by making Presto's behavior compiling with ANSI SQL.

I doubt any client/user would expect and depend on a throw for varchar -> number conversion because of some spaces (although this is not impossible, plenty of weird cases out there).

ANSI SQL says it very clear, thanks @gggrace14 for digging this out!

DubkDB should ignore whitespaces too - Velox uses it in the unit tests.

tdcmeehan commented 1 year ago

Related to #21001 (another blank padding bug)

This works fine in Postgres. My guess is this is because in Postgres, it's not a conversion from varchar to bigint, but char(3) to bigint. The extra space is treated as blank padding, Postgres is consistent in ignoring the blank padding for comparisons, and also apparently when converting between types.

Even this query succeeds, and I suspect it's because there's an implicit coercion from varchar to char for the sake of the cast: SELECT CAST (CAST('100 ' AS VARCHAR(4)) AS INTEGER);

In Presto, it's actually a conversion from varchar to bigint. This causes an issue here, but also in the issue above. For varchar, I don't think blank padding semantics apply.

mbasmanova commented 1 year ago

FYI, there will be result differences between Presto and Velox for try_cast, e.g. NULL in Presto vs. 12 in Velox:

presto:di> select try_cast(' 12' as integer);
 _col0
-------
 NULL
(1 row)
gggrace14 commented 1 year ago

FYI, there will be result differences between Presto and Velox for try_cast, e.g. NULL in Presto vs. 12 in Velox:

presto:di> select try_cast(' 12' as integer);
 _col0
-------
 NULL
(1 row)

Yes, this is true. I should've also included in the description.

So a common case we can predict is the result will have more records if the user has filter like try_cast() IS NOT NULL.

If we want to proceed with the alignment, we should make it into a major Presto release.

spershin commented 8 months ago

Note, that Presto's behavior is also inconsistent. It ignores whitespaces for floating point numbers conversion:

SELECT CAST('  6.0 ' AS REAL)
 _col0
-------
   6.0

But throws an error for integers. Only because Java itself has this behavior, rather than the conscious choice of a query engine. Presto Java uses Long.parseLong, which behaves like this:

Parses the string argument as a signed decimal long. The characters in the string must all be decimal digits, except that the first character may be an ASCII minus sign '-' (\u002D') to indicate a negative value or an ASCII plus sign '+' ('\u002B') to indicate a positive value. The resulting long value is returned, exactly as if the argument and the radix 10 were given as arguments to the parseLong(java.lang.String, int) method. Note that neither the character L ('\u004C') nor l ('\u006C') is permitted to appear at the end of the string as a type indicator, as would be permitted in Java programming language source code.

spershin commented 8 months ago

Submitting https://github.com/prestodb/presto/pull/22284 to change the behavior.