Could we support use spark create ck table with column type wrap with nullable type , null data insert to ck with spark ?

mullerhai commented 3 years ago

Environment

OS version: mac lastest
JDK version: 1.8
ClickHouse Server version: 2.5.4
ClickHouse Native JDBC version: 2.5.4
(Optional) Spark version: N/A 3.0.1
(Optional) Other components' version: N/A

Hi ， I want to insert data use spark with clickhouse， and I has use libraryDependencies += "com.github.housepower" %% "clickhouse-integration-spark" % "2.5.2"， it is best tools, and I have meet error ,when my data has null column ,I think when I insert to set the column struct type ,it I has the columns data type schema maybe it is ok,could we has some option to support infer nullable type data to create ck table with spark ,insert to clickhouse with our tools

Error logs

paste your error logs here

Steps to reproduce

Other descriptions

pan3793 commented 3 years ago

Unfortunately, spark does not provide the field nullable to JdbcDialect https://github.com/housepower/ClickHouse-Native-JDBC/blob/0d5ee97e2dc1ead0d86f23928f71ef43c4834fc3/clickhouse-integration/clickhouse-integration-spark/src/main/scala/org/apache/spark/sql/jdbc/ClickHouseDialect.scala#L100-L104

mullerhai commented 3 years ago

Unfortunately, spark does not provide the field nullable to JdbcDialect https://github.com/housepower/ClickHouse-Native-JDBC/blob/0d5ee97e2dc1ead0d86f23928f71ef43c4834fc3/clickhouse-integration/clickhouse-integration-spark/src/main/scala/org/apache/spark/sql/jdbc/ClickHouseDialect.scala#L100-L104

How do you think overwrite with types.null()

  override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
    case StringType => Some(JdbcType("String", Types.VARCHAR))
    // ClickHouse doesn't have the concept of encodings. Strings can contain an arbitrary set of bytes,
    // which are stored and output as-is.
    // See detail at https://clickhouse.tech/docs/en/sql-reference/data-types/string/
    case BinaryType => Some(JdbcType("String",Types.NULL(Types.BINARY)))
    case BooleanType => Some(JdbcType("UInt8",Types.NULL( Types.BOOLEAN)))
    case ByteType => Some(JdbcType("Int8", Types.NULL(Types.TINYINT)))
    case ShortType => Some(JdbcType("Int16", Types.NULL(Types.SMALLINT)))
    case IntegerType => Some(JdbcType("Int32",Types.NULL( Types.INTEGER)))
    case LongType => Some(JdbcType("Int64",Types.NULL( Types.BIGINT)))
    case FloatType => Some(JdbcType("Float32",Types.NULL( Types.FLOAT)))
    case DoubleType => Some(JdbcType("Float64",Types.NULL( Types.DOUBLE)))
    case t: DecimalType => Some(JdbcType(s"Decimal(${t.precision},${t.scale})", Types.NULL(Types.DECIMAL)))
    case DateType => Some(JdbcType("Date",Types.NULL( Types.DATE)))
    case TimestampType => Some(JdbcType("DateTime",Types.NULL( Types.TIMESTAMP)))

pan3793 commented 3 years ago

How do you think overwrite with types.null()

The point here is that we can't get the nullable of catalyst type dt, so for IntegerType we don't know how mapping it to ClickHouse type, Int32 or Nullable(Int32)?

Currently, we mapping IntegerType to Int32, because Nullable is not suit for some sort key columns. And, as you said, we can't got an expected result for those nullable columns.

Due to limitation said above, we don't recommend you heavily depends on this auto create table feature, a workaround here is auto create table and get the DDL by show create table xxx, then tune DDL manually.

pan3793 commented 3 years ago

Consider the limitation of JDBC DataSource API, we are planning to build a native ClickHouse Connector based on DataSourceV2 API, it's a long term solution and we don't have a ETA for this feature.

mullerhai commented 3 years ago

do we support array fields with spark create ck table ? but I meet error Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for array at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getJdbcType$2(JdbcUtils.scala:188) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getJdbcType(JdbcUtils.scala:188) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$4(JdbcUtils.scala:759) at scala.collection.immutable.Map$EmptyMap$.getOrElse(Map.scala:110) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$3(JdbcUtils.scala:759) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)

pan3793 commented 3 years ago

Not yet

mullerhai commented 3 years ago

Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for array

guoruyi-aa commented 3 years ago

Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for array

if i want to get JDBC type for array,what i should to do?thank you~

pan3793 commented 3 years ago

Please provide a specific case

housepower / ClickHouse-Native-JDBC