microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
274 stars 117 forks source link

Very slow performance #13

Closed rajkve closed 3 years ago

rajkve commented 4 years ago

Hi All,

Config - Databricks 6.6 (Spark 2.45) Target - Azure SQL DB Premium P4

This connector , code

FLOC_VW.write \ .format("com.microsoft.sqlserver.jdbc.spark") \ .mode("overwrite") \ .option("url", url) \ .option("dbtable", tableName) \ .option("user", Username) \ .option("password", Password) \ .option("BEST_EFFORT","true") \ .option("tabLock","true") \ .save()

Minutes taken - 53 on Average

Same configuration but - spark connector - Performance 4x faster <10 mins of average

Any suggestions where it could be going wrong?

shivsood commented 4 years ago

[old spark connector](old spark connector and this should have the same perf as both use the bulk APIs. ) and mssql spark connector should have the same perf as both use the bulk APIs. I have not benchmarked this against each other though.

Can u please share configuration details e.g. data set used, spark configuration, time in seconds to write by each, methodology etc.

rajkve commented 4 years ago

Hi -

Spark cluster - 2.4.5, Scala 11 Driver - 32 GB memory , 16 cores Worker - 23 GB 4 Cores (Min nodes 5, max nodes 20)

Source - ADLS GEN1 Parquet file size - 500 MB (5 Million records)

Target - Azure SQL DB (Premium P4)

Table size - 12 GB (Most of the fields are text fields)

Data upload performance -

using this connector - 55 mins (BEST_EFFORT + TAB_LOCK = true) - source code in the first post above using Spark connector - 10 Mins

sebbm commented 4 years ago

Hi, maybe you should try to increase the batchSize as this is set to 1.000 by default.

rajkve commented 4 years ago

Hi, I tried with batch size = 10000, 50000, 500000 - it takes approximately the same time.

denzilribeiro commented 4 years ago

@rajkve is your base table a heap or have columnstore index?

sirurza commented 4 years ago

We have the same problem, parquet in ADLS GEN2 To table with 30 columns, with their types; char, varchar, int, bit and decimal (18,6). The table does not have any contraint, primary keys, or indexes.

Method Nº Rows BatchSize TableLock Timeout ReliabilityLevel Time (min) Time stimate (min)
bulkCopyToSqlDB 1.000.000 1.000 true 1000   1,65 36,3
bulkCopyToSqlDB 1.000.000 500.000 true 1000   1,59 34,98
bulkCopyToSqlDB 5.000.000 1.000 true 1000   8,11 35,684
bulkCopyToSqlDB 5.000.000 500.000 true 1000   7,6 33,44
bulkCopyToSqlDB 10.000.000 1.000 true 1000   15,54 34,188
bulkCopyToSqlDB 10.000.000 100.000 true 1000   15,49 34,078
JDBC 1.000.000 1.000       2,18 47,96
JDBC 1.000.000 500.000       2,77 60,94
JDBC 5.000.000 1.000       10,66 46,904
JDBC 5.000.000 500.000       13,68 60,192
JDBC 10.000.000 1.000       21,1 46,42
JDBC 10.000.000 10.000       21,1 46,42
jdbc.spark 1.000.000 1.000 true   BEST_EFFORT 2,21 48,62
jdbc.spark 1.000.000 10.000 true   BEST_EFFORT 2,12 46,64
jdbc.spark 1.000.000 100.000 true   BEST_EFFORT 2,12 46,64
jdbc.spark 1.000.000 500.000 true   BEST_EFFORT 2,12 46,64
jdbc.spark 5.000.000 1.000 true   BEST_EFFORT 10,63 46,772
shivsood commented 4 years ago

@sirurza I am not able to interpret these results.

sirurza commented 4 years ago

Hi Our table have 22000000 rows, stimate is the total time in completed task. The time column is the time it takes to complete the No.Rows

The values are shown in decimals, that is, the decimal part is the seconds, in other words 36.3 is 36 min and the seconds are 3 * 60/100, the result is 36min with 11 seconds.

bulkCopyToSQLDB = https://github.com/Azure/azure-sqldb-spark

DataBanshee commented 4 years ago

Hi Team,

I'm facing a similar issue, the default connector is much faster then the sql-spark-connector even though I tinkered with the Table Lock and the Batch Size... :( I was hoping that this connector could help me out in writing 70M rows with 8 nodes (28GB of Ram each)

denzilribeiro commented 4 years ago

@sirurza / @DataBanshee can you please post your snippet of the Dataframe loading the data? During the run what are predominant waits? a. Current waits - https://www.sqlskills.com/blogs/paul/updated-sys-dm_os_waiting_tasks-script-2/ b. Take 2 snapshots say 30 seconds apart say - https://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/

DataBanshee commented 4 years ago

Hey @denzilribeiro Thank you for you help :) The job seems to get stuck in the last task and stay hanging. 1M rows is the maximum I could achieve without a frozen task sql-probel-1 sql-probel-2 sql-probel-3 sql-probel-5

I also tried to size up the cluster on the Spark side with no luck... I have been on this for days and still can't understand why Millions of rows are such challenge Grateful for your help !

denzilribeiro commented 4 years ago

@DataBanshee - Couple things I am seeing a. If using a heap use tableLock "true" b. For a single task - are you reading from ADLS/HDFS that is partitioned or something where 1 task gets more rows or are they evenly distributed?

Will be easier to take this on email -- email me at denzilr at microsoft .

ankitbko commented 4 years ago

@denzilribeiro Can confirm this connector bulk insert slower than the older Spark Connector. I have TPC-DS dataset in parquet and a notebook containing code to insert into Azure SQL hyperscale using both old and new connector. The old connector is atleast 2x faster. I was writing a blog about the new connector when I stumbled across this issue and can include the perf comparison there which will take few days to complete. Or I can show you right away if you can ping me at anksinha

ankitbko commented 4 years ago

I ran a benchmark comparing timing by inserting 50 million records in databases with different indexing strategy. You can find it here. Scroll down to the bottom so see comparison between old and new connector.

shivsood commented 4 years ago

Ack on this issue. I have a repro now. Thanks @ankitbko . 'll be investigating further. For now, what i see is that bulk api is being slow with the new connector. Have to figure out why.

shivsood commented 4 years ago

Create a test app to compare performance of old v/s new connector with indexed tables. Refer here https://github.com/microsoft/sql-spark-connector/pull/56 . Used TPCDS store_sales table generated with 1G data

B4PJS commented 3 years ago

@rajkve ; can you try repartitioning down to the same number of cores as your sql db has (urg just noticed dtu model, try a value of 10 - 15 and use sp_whoisactive to see how many threads in parallel can execute). A: This is always best practice using bulk load B: It looks like your data is badly skewed, something that maybe scala is handling better than pyspark. Repartitioning may help this I have used this to load billions of rows into Hyperscale and have never had any perf problems. Also, for batch size loading into a heap, you want to set this to something that equates to 8MB or less as this is the page size

shivsood commented 3 years ago

All, Adding this here again. I dont see any issues with perf comparison with old v/s new connector. I have a test app coded up here. Refer here #56. Please try this out and comment back if you still see any issues.

ankitbko commented 3 years ago

After disabling the schema check both of the connectors are comparable in performance. Thanks @shivsood for helping out. I will update my blog during this weekend.