Truncate Column Rename Step

yannistze commented 1 week ago

Hello,

If I understand the order of operations in the SQLPushDownRule.scala file correctly, the first step after adding the shared context to all the Relations is to rename all the columns in each Relation to unique names following the normalizedExprIdMap logic: https://github.com/memsql/singlestore-spark-connector/blob/8e70dffe5a9dfd5b2bf6c8fb856e035b5a8c9825/src/main/scala/com/singlestore/spark/SQLPushdownRule.scala#L41-L45

This makes sense, to avoid Duplicate Name issues later on for example in Joins since they are wrapped around a selectAll statement instead of a select statement with aliases: https://github.com/memsql/singlestore-spark-connector/blob/8e70dffe5a9dfd5b2bf6c8fb856e035b5a8c9825/src/main/scala/com/singlestore/spark/SQLGen.scala#L470 https://github.com/memsql/singlestore-spark-connector/blob/8e70dffe5a9dfd5b2bf6c8fb856e035b5a8c9825/src/main/scala/com/singlestore/spark/SQLGen.scala#L488 https://github.com/memsql/singlestore-spark-connector/blob/8e70dffe5a9dfd5b2bf6c8fb856e035b5a8c9825/src/main/scala/com/singlestore/spark/SQLGen.scala#L502

The issue that I am facing with this approach is that:

when either (or all) Relation(s) has(have) too many columns (above 50-80)
when the column name conventions are long
and we fully PushDown

the query string becomes too long and thus leading to the schema fetch and subsequent PrepareStatement code to fail.

Tried:

truncating the query in a few places and keeping the Connector logic the same which helped but didn't fully solve the issue
using qualifiers instead of renaming each column which also helped but required a more extensive refactoring of the Connector (ex. joins) and thus making it more risky

I was wondering if there is any guidance in scenarios like the one I am facing ?

Thanks

AdalbertMemSQL commented 3 days ago

Hello, Could you clarify what error you are encountering?

yannistze commented 3 days ago

Hello, Could you clarify what error you are encountering?

Hello, for sure, the error I get is the following generic one:

java.sql.SQLTransientConnectionException: Driver has reconnect connection after a communications link failure with address=(host=10.133.121.176)(port=3306)(type=primary)
  at com.singlestore.jdbc.client.impl.MultiPrimaryClient.replayIfPossible(MultiPrimaryClient.java:212)
  at com.singlestore.jdbc.client.impl.MultiPrimaryClient.execute(MultiPrimaryClient.java:345)
  at com.singlestore.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:69)
  at com.singlestore.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:251)
  at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:122)
  at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:122)
  at com.singlestore.spark.JdbcHelpers$.loadSchema(JdbcHelpers.scala:137)
  at com.singlestore.spark.SinglestoreReader.schema$lzycompute(SinglestoreReader.scala:84)
  at com.singlestore.spark.SinglestoreReader.schema(SinglestoreReader.scala:84)
...

that if I am not mistaken comes from this codepath in the JDBC Driver:

        // no transaction, but connection is now up again.
        // changing exception to SQLTransientConnectionException
        throw new SQLTransientConnectionException(
            String.format(
                "Driver has reconnect connection after a communications link failure with %s",
                oldClient.getHostAddress()),
            "25S03");

and "masks" the root cause 😞

memsql / singlestore-spark-connector

Truncate Column Rename Step #93