scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

Migrator fails to create DynamoDB table in Scylla if Dynamo table is larger than 100 attributes #86

Closed apesaDS closed 3 weeks ago

apesaDS commented 1 year ago

When migrating from DynamoDB to Scylla Alternator Migrator fails to create table in Scylla if DynamoDB table is larger than 100 attributes In DynamoDataSourceReader.scala, line 97 if (typeSeq.size > 100) throw new RuntimeException("Schema inference not possible, too many attributes in table.") I changed typeSeq.size > 150 It gets past this exception but fails to creaate the table I need in Scylla. Seems to fail in NameTools.scala in this case statement case Some(KeyspaceSuggestions(keyspaces)) => s"Couldn't find table $table in $keyspace - Found similar keyspaces with that table:\n${keyspaces.map(k => s"$k.$table").mkString("\n")}"

tarzanek commented 1 year ago

this will be partially fixed by https://github.com/scylladb/scylla-migrator/tree/spark-3.1.1 it adds https://github.com/scylladb/spark-dynamodb/commit/978c0cd7e49f99b976d0192db0a027d573f83991 which has inferSchema=false but it'd need to be exposed in migrator, too, for above to work

apesaDS commented 1 year ago

Thx, what would it take to expose this in mIgrator?

arkgal commented 1 year ago

@apesaDS @tarzanek What would be the current status here? It seems like taking pretty time? Do we have any ETA on it? Thanks guys!

arkgal commented 1 year ago

@apesaDS @tarzanek Do we have any progress/updates here? Thanks!

cc: @DoronArazii

tarzanek commented 1 year ago

plan is to fix it with next spark 3 version which lifts this limit, for now increasing the limit is the workaround

tarzanek commented 1 year ago

btw. the quick (and dirty) patch in module spark-dynamodb is

diff --git a/src/main/scala/com/audienceproject/spark/dynamodb/datasource/DynamoDataSourceReader.scala b/src/main/scala/com/audienceproject/spark/dynamodb/datasource/DynamoDataSourceReader.scala
index fdf6e27..6052595 100644
--- a/src/main/scala/com/audienceproject/spark/dynamodb/datasource/DynamoDataSourceReader.scala
+++ b/src/main/scala/com/audienceproject/spark/dynamodb/datasource/DynamoDataSourceReader.scala
@@ -94,7 +94,7 @@ class DynamoDataSourceReader(parallelism: Int,
         })
         val typeSeq = typeMapping.map({ case (name, sparkType) => StructField(name, sparkType) }).toSeq

-        if (typeSeq.size > 100) throw new RuntimeException("Schema inference not possible, too many attributes in table.")
+        if (typeSeq.size > 150) throw new RuntimeException("Schema inference not possible, too many attributes in table.")

         StructType(typeSeq)
     }
tarzanek commented 1 year ago

fwiw after above I was able to reproduce the issue with converting of BigDecimal to Decimal but for that I don't have any fix yet

julienrf commented 3 weeks ago

I believe this issue has been fixed since we don’t infer the table schema anymore. Please let me know if you are still blocked.