microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
274 stars 116 forks source link

Remove limitation for 4000 max char #202

Closed osamamb closed 1 year ago

osamamb commented 1 year ago

When reading data from Synapse to Databricks, if the field length is more than 4000 char, the string is trimmed, it would be great to remove this limitation.

luxu1-ms commented 1 year ago

Spark MS SQL connector does not add limitation to the field length or trim the string.

moredatapls commented 1 year ago

@osamamb maybe you could provide a reproducible? otherwise it's really hard to find out what the problem is

osamamb commented 1 year ago

@moredatapls , I'm in contact with the customer to provide the code to reproduce, they have provided the below error message which causes the whole job to fail.

String or binary data would be truncated while reading column of type

I'll update the thread once I get the code.

osamamb commented 1 year ago

@moredatapls , below is the notebook code been used to reproduce the issue:


# Databricks notebook source
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([ \
    StructField("id", IntegerType(), True), \
    StructField("name",StringType(),True), \
    StructField("description",StringType(),True)
  ])

data = [(1,"Ram","Alluri Sitarama Raju was born to a middle-class family in 1897 in a remote village located near the coastal city of Visakhapatnam. He was a person with extreme passion to fight against the oppression of the British prevailing during that time. Some of the major incidents show the patriotism that Raju had developed with time. In 1992, Alluri Sitarama Raju led the Rampa Rebellion also called Manyam Rebellion, a tribal uprising opposing the Madras Forest Act 1882.Sitarama Raju was regarded as the 'hero of the forest' for his fearless act of standing against British police, revenue, and forest officials and motivating his people for the same. During his teens, Sitarama Raju had lost his father and then he went on a pilgrimage visiting North, Western, North Eastern, and North-Western India. In May 1924, Alluri Sitarama Raju was captured by the British government and was shot dead."),
    (2,"Bheem","Komaram Bheem is another freedom fighter of the time born in 1901 in the Adilabad district of Telangana. He was from Gond tribes, who protested for the rights of workers in the 1990s.Bheem's father was killed by the forest authorities for asserting Adivasi rights after which he fled from Sankepalli. One of the Nizam officials died at the hands of Bheem after he harassed the people from the Gonds community for taxes. He then moved to Chandrapur and got employed at a printing press from where he published a magazine against the Nizam and British. Bheem was captured by British officials but from the jail, he fled away to the northeastern state of Assam and started working on a tea plantation and continued for more than 4 years.During the early 1990s, Bheem began his rebellion against the oppression of local landowners and the Nizam of Hyderabad but he was later killed in October 1940.Komaram Bheem (1900/1901–1940), alternatively Kumram Bheem was a revolutionary leader in Hyderabad State of British India from the Gond tribes. Bheem, in association with other Gond leaders, led a protracted low intensity rebellion against the feudal Nizams of Hyderabad in the eastern part of the princely state during the 1930s, which contributed in the culmination of the Telangana Rebellion of 1946. He was killed by armed policemen in 1940, subsequently lionised as a symbol of rebellion, and eulogised in Adivasi and Telugu folklore. Bheem is deified as a pen in Gond culture and is credited for coining the slogan Jal, Jangal, Zameen (Water, Forest, Land) which, symbolising a sentiment against encroachment and exploitation, has been adopted by Adivasi movements as a call to action. He is also closely associated with the movement for Telangana statehood. Komaram Bheem was born in Sankepalli, near Asifabad in Hyderabad State, British India to a family in the Gondi tribal community. He is generally considered to have been born on 22 October 1901, although some consider it to be in 1900. Bheem grew up in the tribal populated forests within the traditional kingdoms of Chanda and Ballalpur, isolated from the rest of the world and received no formal education. He kept moving from place to place throughout his life as the Gondi people were becoming increasingly victimised by exploitation from zamindars (landlords) and businessmen, and through extortion by the jangaalat police (forest police).During the 1900s, there was expansion of mining activities and strengthening of state authority in the Gondi region, regulations were introduced and enforced which hampered the subsistence activities of the Gondis. Zamindars were granted lands in their regions and they imposed taxes on Gondi podu farming activities, non-compliance often resulting in severe arbitration including forced amputations. Gondis began migrating out of their traditional villages, the situation led to occasional retaliations and protestations; Bheem's father was killed by forest officials in one such incident.Following the death of his father, Bheem and his family moved out to Sankepalli to Sardapur near Karimnagar. The Gonds who had migrated to Sardapur settled in barren land owned by the zamindar Laxman Rao; they began subsistence farming on the land and subsequently became a target for tax extraction. In a confrontation in October 1920, Bheem killed a senior official of the Nizamate, Siddiquesaab who was sent by Rao to enforce the confiscation of crops during the time of harvest. To escape capture, he ran away on foot to the city of Chanda with his friend named Kondal. The two were granted refuge by a local publisher Vitoba who operated a printing press and distribution network across the regional railways for an anti–British anti–Nizamate magazine. Bheem learned to speak and read English, Hindi and Urdu during time working with Vitoba.Bheem was forced to run away again after Vitoba was arrested, on this occasion, to a tea plantation in Assam with an acquaintance at the Manchiryal railway station. He worked in the plantations for four and a half years. While doing so he became involved in labour union activities and was eventually arrested for such. Bheem escaped jail within four days, boarded a goods train and returned to Balharshah in the Nizamate. Bheem had heard of Ramji Gond in his childhood, so he decided to initiate his own struggle for the rights of the Adivasis on his return to the Nizamate. Bheem moved to Kakanghat with family and started working for Lacchu Patel who was the head of a village called Devadam. Leveraging his experience in Assam, he helped Patel in a land litigation against the Asifabad estate which made him well known in the nearby villages and in return he was granted permission by Patel to marry.Bheem married a woman named Som Bai, moved to Bhabejhari in the interior of the Gond lands and settled down to cultivate a piece of land. During the time of harvest, he was again approached by forest officials who tried to force him to leave arguing that the land belonged to the state. Bheem then tried to lobby the Nizam directly and sought to present the grievances of the Adivasis before him but he received no response. Bheem then decided to engage in armed revolution. He formed clandestine associations with the banned Communist Party of India, and started mobilising the Adivasi population at Jodeghat, eventually calling a council of tribal leaders from the twelve traditional districts of Ankusapur, Bhabejhari, Bhimangundi, Chalbaridi, Jodeghat, Kallegaon, Koshaguda, Linepatter, Narsapur, Patnapur, Shivaguda and Tokennavada. The council decided to form a guerilla army to protect their lands. Bheem also proposed they declare themselves an independent Gond kingdom. Some see this as a predecessor to more recent attempts to form an autonomous Gondwana.The council was followed by an uprising in the Gondi region which began in 1928. The forces mobilised to attacked the zamindars in Babejhari and Jodeghat. In response, the Nizam recognised Bheem as leader of the Gond rebels and sent the collector at Asifabad to negotiate with him, offering assurances of land grants to the Gonds. Bheem rejected the initial offer stating that they sought justice and instead demanded regional autonomy for the Gonds, eviction of the forest officials and zamindars, and the release of all Gond prisoners in the penal system of Hyderabad state. The demands were rejected and the conflict continued as a low intensity guerilla campaign over the following decade. Bheem directly commanded 300 men under him and operated out of Jodeghat. He is said to have coined the slogan Jal, Jangal, Zameen (Water, Forest, Land) during this period.Bheem's whereabouts were eventually discovered by Kurdu Patel and he was killed in an encounter with armed policemen led by the talukdar of Asifabad, Abdul Sattar. Fifteen others were killed in the encounter. The date of his death is disputed, it's officially recognised to have occurred in October 1940 but Gondi people commemorate it on 8 April 1940.")
  ]

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
display(df)

# COMMAND ----------

df.write.format("delta").mode("overwrite").save("/mnt/tmp/synapse_length_issue")

# COMMAND ----------

# MAGIC %sql
# MAGIC create table if not exists default.synapse_length_issue
# MAGIC (
# MAGIC     id int,
# MAGIC     name string,
# MAGIC     description string
# MAGIC )
# MAGIC USING DELTA 
# MAGIC LOCATION '/mnt/tmp/synapse_length_issue'

# COMMAND ----------

# MAGIC %sql
# MAGIC select count(*) from default.synapse_length_issue;

# COMMAND ----------

# MAGIC %sql
# MAGIC select * from default.synapse_length_issue;

# COMMAND ----------

from pyspark.sql import functions as f
df = spark.sql("select * from default.synapse_length_issue")
df1 = df.select([f.length(col).alias(col) for col in df.columns])
df1.groupby().max().show()

# COMMAND ----------

'''
Create Synapse Table as Below:

IF NOT EXISTS (
        SELECT name
        FROM sys.schemas
        WHERE name = 'b2b_test'
        )
    EXEC sp_executesql N'CREATE SCHEMA [b2b_test]'
GO
IF NOT EXISTS (
        SELECT *
        FROM sys.objects
        WHERE object_id = OBJECT_ID(N'b2b_test.synapse_length_issue')
            AND type IN (N'U')
        )

BEGIN
CREATE TABLE [b2b_test].[synapse_length_issue]
(
    [id] [int] NULL,
    [name] [varchar](4000) NULL,
    [description] [varchar](8000) NULL
)
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
)
END
'''

# COMMAND ----------

from utils.commons import get_synapse_connection_config
from utils.config import config
params = config(filename='/dbfs/mnt/apps/b2b_datalake/resources/connection_properties.ini', section='synapse')    
connection_details = get_synapse_connection_config(spark,params)
tempDir = connection_details['stg_path'] + "data/processed/tmp"

# COMMAND ----------

df.write.format("com.databricks.spark.sqldw") \
            .option("url", connection_details['url']) \
            .option("user", connection_details['user']) \
            .option("password", connection_details['password']) \
            .option("preActions", "select 1") \
            .option("postActions", "select 1") \
            .option("useAzureMSI", "true") \
            .option("tempDir", tempDir) \
            .option("maxStrLength", "4000") \
            .option("dbTable", "b2b_test.synapse_length_issue").mode("overwrite").save()

# COMMAND ----------

# MAGIC %md
# MAGIC Observations:
# MAGIC 
# MAGIC Error-1:
# MAGIC     
# MAGIC     If we give the following property of maxStrLength > 4000, it is giving following error:
# MAGIC         com.databricks.spark.sqldw.SqlDWConnectorException: Exception encountered in Azure Synapse Analytics connector code.
# MAGIC         
# MAGIC Error-2:
# MAGIC 
# MAGIC     If the length of string column in delta table (say description) > 4000 and trying to write data to synapse table, we are facing below error:
# MAGIC         com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector.
# MAGIC         
# MAGIC         When you run with ADF, you will get more elobarated details as below:
# MAGIC         Underlying SQLException(s):
# MAGIC         - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]
# MAGIC         
# MAGIC    
moredatapls commented 1 year ago

@osamamb you are using the Databricks SqlDw connector in your example:

df.write.format("com.databricks.spark.sqldw") \

Unfortunately, this is proprietary Databricks code, see here: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html. This repo here is about the SQL Spark connector (com.microsoft.sqlserver.jdbc.spark), which means we can't help you out here. So I would advise you to contact the Databricks support instead, they should be able to help