Closed rajkve closed 3 years ago
[old spark connector](old spark connector and this should have the same perf as both use the bulk APIs. ) and mssql spark connector should have the same perf as both use the bulk APIs. I have not benchmarked this against each other though.
Can u please share configuration details e.g. data set used, spark configuration, time in seconds to write by each, methodology etc.
Hi -
Spark cluster - 2.4.5, Scala 11 Driver - 32 GB memory , 16 cores Worker - 23 GB 4 Cores (Min nodes 5, max nodes 20)
Source - ADLS GEN1 Parquet file size - 500 MB (5 Million records)
Target - Azure SQL DB (Premium P4)
Table size - 12 GB (Most of the fields are text fields)
Data upload performance -
using this connector - 55 mins (BEST_EFFORT + TAB_LOCK = true) - source code in the first post above using Spark connector - 10 Mins
Hi, maybe you should try to increase the batchSize as this is set to 1.000 by default.
Hi, I tried with batch size = 10000, 50000, 500000 - it takes approximately the same time.
@rajkve is your base table a heap or have columnstore index?
We have the same problem, parquet in ADLS GEN2 To table with 30 columns, with their types; char, varchar, int, bit and decimal (18,6). The table does not have any contraint, primary keys, or indexes.
Method | Nº Rows | BatchSize | TableLock | Timeout | ReliabilityLevel | Time (min) | Time stimate (min) |
---|---|---|---|---|---|---|---|
bulkCopyToSqlDB | 1.000.000 | 1.000 | true | 1000 | 1,65 | 36,3 | |
bulkCopyToSqlDB | 1.000.000 | 500.000 | true | 1000 | 1,59 | 34,98 | |
bulkCopyToSqlDB | 5.000.000 | 1.000 | true | 1000 | 8,11 | 35,684 | |
bulkCopyToSqlDB | 5.000.000 | 500.000 | true | 1000 | 7,6 | 33,44 | |
bulkCopyToSqlDB | 10.000.000 | 1.000 | true | 1000 | 15,54 | 34,188 | |
bulkCopyToSqlDB | 10.000.000 | 100.000 | true | 1000 | 15,49 | 34,078 | |
JDBC | 1.000.000 | 1.000 | 2,18 | 47,96 | |||
JDBC | 1.000.000 | 500.000 | 2,77 | 60,94 | |||
JDBC | 5.000.000 | 1.000 | 10,66 | 46,904 | |||
JDBC | 5.000.000 | 500.000 | 13,68 | 60,192 | |||
JDBC | 10.000.000 | 1.000 | 21,1 | 46,42 | |||
JDBC | 10.000.000 | 10.000 | 21,1 | 46,42 | |||
jdbc.spark | 1.000.000 | 1.000 | true | BEST_EFFORT | 2,21 | 48,62 | |
jdbc.spark | 1.000.000 | 10.000 | true | BEST_EFFORT | 2,12 | 46,64 | |
jdbc.spark | 1.000.000 | 100.000 | true | BEST_EFFORT | 2,12 | 46,64 | |
jdbc.spark | 1.000.000 | 500.000 | true | BEST_EFFORT | 2,12 | 46,64 | |
jdbc.spark | 5.000.000 | 1.000 | true | BEST_EFFORT | 10,63 | 46,772 |
@sirurza I am not able to interpret these results.
Hi Our table have 22000000 rows, stimate is the total time in completed task. The time column is the time it takes to complete the No.Rows
The values are shown in decimals, that is, the decimal part is the seconds, in other words 36.3 is 36 min and the seconds are 3 * 60/100, the result is 36min with 11 seconds.
bulkCopyToSQLDB = https://github.com/Azure/azure-sqldb-spark
Hi Team,
I'm facing a similar issue, the default connector is much faster then the sql-spark-connector even though I tinkered with the Table Lock and the Batch Size... :( I was hoping that this connector could help me out in writing 70M rows with 8 nodes (28GB of Ram each)
@sirurza / @DataBanshee can you please post your snippet of the Dataframe loading the data? During the run what are predominant waits? a. Current waits - https://www.sqlskills.com/blogs/paul/updated-sys-dm_os_waiting_tasks-script-2/ b. Take 2 snapshots say 30 seconds apart say - https://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/
Hey @denzilribeiro Thank you for you help :) The job seems to get stuck in the last task and stay hanging. 1M rows is the maximum I could achieve without a frozen task
I also tried to size up the cluster on the Spark side with no luck... I have been on this for days and still can't understand why Millions of rows are such challenge Grateful for your help !
@DataBanshee - Couple things I am seeing a. If using a heap use tableLock "true" b. For a single task - are you reading from ADLS/HDFS that is partitioned or something where 1 task gets more rows or are they evenly distributed?
Will be easier to take this on email -- email me at denzilr at microsoft .
@denzilribeiro Can confirm this connector bulk insert slower than the older Spark Connector. I have TPC-DS dataset in parquet and a notebook containing code to insert into Azure SQL hyperscale using both old and new connector. The old connector is atleast 2x faster. I was writing a blog about the new connector when I stumbled across this issue and can include the perf comparison there which will take few days to complete. Or I can show you right away if you can ping me at anksinha
I ran a benchmark comparing timing by inserting 50 million records in databases with different indexing strategy. You can find it here. Scroll down to the bottom so see comparison between old and new connector.
Ack on this issue. I have a repro now. Thanks @ankitbko . 'll be investigating further. For now, what i see is that bulk api is being slow with the new connector. Have to figure out why.
Create a test app to compare performance of old v/s new connector with indexed tables. Refer here https://github.com/microsoft/sql-spark-connector/pull/56 . Used TPCDS store_sales table generated with 1G data
@rajkve ; can you try repartitioning down to the same number of cores as your sql db has (urg just noticed dtu model, try a value of 10 - 15 and use sp_whoisactive to see how many threads in parallel can execute). A: This is always best practice using bulk load B: It looks like your data is badly skewed, something that maybe scala is handling better than pyspark. Repartitioning may help this I have used this to load billions of rows into Hyperscale and have never had any perf problems. Also, for batch size loading into a heap, you want to set this to something that equates to 8MB or less as this is the page size
All, Adding this here again. I dont see any issues with perf comparison with old v/s new connector. I have a test app coded up here. Refer here #56. Please try this out and comment back if you still see any issues.
After disabling the schema check both of the connectors are comparable in performance. Thanks @shivsood for helping out. I will update my blog during this weekend.
Hi All,
Config - Databricks 6.6 (Spark 2.45) Target - Azure SQL DB Premium P4
This connector , code
FLOC_VW.write \ .format("com.microsoft.sqlserver.jdbc.spark") \ .mode("overwrite") \ .option("url", url) \ .option("dbtable", tableName) \ .option("user", Username) \ .option("password", Password) \ .option("BEST_EFFORT","true") \ .option("tabLock","true") \ .save()
Minutes taken - 53 on Average
Same configuration but - spark connector - Performance 4x faster <10 mins of average
Any suggestions where it could be going wrong?