nightscape / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
465 stars 146 forks source link

Cluster Stops Processing Files After Six Reads of Files have been completed #750

Open brian-custer opened 1 year ago

brian-custer commented 1 year ago

Is there an existing issue for this?

Current Behavior

I am using com.crealytics:spark-excel_2.12:3.3.1_0.18.7 to process and read over 70 excel files in my data lake. It appears that the cluster stops working after reading approximately 6 workbooks. It appears to hang and it stops reading anymore workbooks. I am using a cluster with a minimum of 2 worker nodes to a maximum of 8 worker nodes. It appears as if the cluster runs out of memory or something that prevents it from reading any more workbooks.

Expected Behavior

Expected behavior is that I should be able to read all 70 workbooks in my data lake and append the data to an existing delta table in my unity catalog.

Steps To Reproduce

Use the following code to loop through all the files in a data lake folder and read the spreadsheets and append to a unity catalog table:

for file in dbutils.fs.ls (pathToParcelData): print(counter) fileDate = dt.datetime.utcfromtimestamp (file.modificationTime/1000).strftime('%Y-%m-%d') df = spark.read.format("com.crealytics.spark.excel") \ .option("header", "true") \ .option("inferSchema", "true") \ .option("dataAddress", "'Transaction Detail'!A1") \ .load(file.path) for field in df.schema.fieldNames(): df = df.withColumnRenamed (field, field.removesuffix(" ")) df = df.withColumnRenamed ("Invoice #", "invoicenum").withColumnRenamed ("Tracking #", "trackingnum").withColumnRenamed ("Control #", "controlnum").withColumnRenamed ("Invoice Date", "invoicedate").withColumnRenamed ("Invoice Amount", "invoiceamount").withColumnRenamed ("Ship Date", "shipdate").withColumnRenamed ("Delivery Date", "deliverydate").withColumnRenamed ("Service Level", "servicelevel").withColumnRenamed("Actual Weight", "actualweight").withColumnRenamed ("Bill Weight", "billweight").withColumnRenamed ("Audited Amount", "totalcharge").withColumnRenamed ("Zone", "zone").withColumnRenamed ("Manual", "glcode") df = df.select ("invoicenum", "invoicedate", "invoiceamount", "trackingnum", "shipdate", "deliverydate", "servicelevel", "zone", "actualweight", "billweight", "glcode", "controlnum", "totalcharge") df = df.withColumn ("zone", df["zone"].cast(StringType())) if counter > 0: df.write.mode("append").saveAsTable ("sources.shipping.parcel") else: df.write.mode("overwrite").saveAsTable ("sources.shipping.parcel") counter +=1

Environment

- Spark version: 3.4.0
- Spark-Excel version: 3.3.1
- OS: databricks
- Cluster environment: Standard_DS3_v2 worker and executor nodes. 2-8 worker nodes

Anything else?

No response

nightscape commented 1 year ago

Can you try with format excel instead? This is the new v2 version. It should also support reading multiple files at once, so you could try just pointing it directly to the directory.

brian-custer commented 1 year ago

Thanks for the response. Is it capable of reading up to 70 Excel files at a time. I changed my code as you suggested but the driver is hung. I suspect it is a memory issue but I’m not sure.

From: Martin Mauch @.> Date: Sunday, June 18, 2023 at 2:48 PM To: crealytics/spark-excel @.> Cc: Brian Custer @.>, Author @.> Subject: EXTERNAL - Re: [crealytics/spark-excel] Cluster Stops Processing Files After Six Reads of Files have been completed (Issue #750) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Can you try with format excel instead? This is the new v2 version. It should also support reading multiple files at once, so you could try just pointing it directly to the directory.

— Reply to this email directly, view it on GitHubhttps://github.com/crealytics/spark-excel/issues/750#issuecomment-1596277859, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3FCQ5AIGK3GDU2MSAX5NKTXL5ZQ5ANCNFSM6AAAAAAZK4WR4U. You are receiving this because you authored the thread.Message ID: @.***>

nightscape commented 1 year ago

Can you post the new code you're using? Btw, the spark-excel version you mentioned in the first issue doesn't look right. It should probably be 0.18.???