marcell-ferencz-databricks / dash-hackathon-0424

0 stars 0 forks source link

NY Flood/01 Load Data To Delta - Soil data #2

Open tomkdefra opened 5 months ago

tomkdefra commented 5 months ago

Cell 28:

mupolygon_df = mos.read()\ .format("multi_read_ogr")\ .option("vsizip", "true")\ .option("layerName", "mupolygon")\ .load(f"{ROOT_PATH_SPARK}/soil/")\ .repartition(200, F.col("Shape"))\ .withColumn("geom", mos.st_updatesrid("Shape", "Shape_srid", F.lit(4326)))

mupolygon_df.display()

java.lang.NullPointerException

tomkdefra commented 5 months ago

Cell 31:

java.lang.NullPointerException

also can we delete the commented code to avoid confusion?

marcell-ferencz-databricks commented 5 months ago

HI @tomkdefra

Can you please try the following:

  1. The downloaded file is a valid file: %sh stat -c %s $ROOT_PATH/soil/soil.zip returns more than 0 (should be 721669877 to be exact)
  2. Can you rename the file to soil.gdb.zip, i.e. %sh mkdir -p $ROOT_PATH/soil_backup && cp -d $ROOT_PATH/soil/soil.zip $ROOT_PATH/soil_backup && mv $ROOT_PATH/soil/soil.zip $ROOT_PATH/soil/soil.gdb.zip (this should back up the original file to a separate directory and do the renaming after)
  3. You can ignore the SApolygon stuff for now as I don't actually think we use it.
tomkdefra commented 5 months ago

stat soil.zip does indeed return a valid file 721669877

and renaming it works fine. Now stat soil.gdb.zip returns 721669877

marcell-ferencz-databricks commented 5 months ago

Can you try running cell 28 again?

tomkdefra commented 5 months ago

OK, that does appear to have solved this issue. All the Spark jobs run in cell 28 have completed. Cell 29 (write to delta) is running now. I'll let you know here if that completes.

marcell-ferencz-databricks commented 5 months ago

To put context to this...

This is my mistake, that read expects a .gdb.zip file (see .option("vsizip", "true")), so

  1. Downloading the file as a .zip caused the error
  2. I shouldn't have unzipped the file in the first place...
tomkdefra commented 5 months ago

Ahh thank you for that. I certainly wouldn't have caught it!

tomkdefra commented 5 months ago
image

It's eating something chunky.. for the hackathon would it be wise to spec more powerful clusters? I'm just testing on a Standard_DS3_v2 - 12GB and 4 cores.

marcell-ferencz-databricks commented 5 months ago

Is that a single node cluster? It took me 7 minutes to do that load with the below cluster (2 workers):

image

I'm not sure we need to ask everyone to do the individual loads -- we can just give access to the Delta tables. However, it's worth testing how the steps later perform for you. We're running a fair amount of Spark here so it's worth having a multi-node cluster.

tomkdefra commented 5 months ago

I should have thought about that. I don't think I have sufficient permission to create multi-node clusters. I'll chat to one of our platform admins and see if I can sort that!

Re. just providing access to the Delta tables - yes, absolutely.