Open tomkdefra opened 5 months ago
Cell 31:
java.lang.NullPointerException
also can we delete the commented code to avoid confusion?
HI @tomkdefra
Can you please try the following:
%sh stat -c %s $ROOT_PATH/soil/soil.zip
returns more than 0 (should be 721669877 to be exact)%sh mkdir -p $ROOT_PATH/soil_backup && cp -d $ROOT_PATH/soil/soil.zip $ROOT_PATH/soil_backup && mv $ROOT_PATH/soil/soil.zip $ROOT_PATH/soil/soil.gdb.zip
(this should back up the original file to a separate directory and do the renaming after)stat soil.zip does indeed return a valid file 721669877
and renaming it works fine. Now stat soil.gdb.zip returns 721669877
Can you try running cell 28 again?
OK, that does appear to have solved this issue. All the Spark jobs run in cell 28 have completed. Cell 29 (write to delta) is running now. I'll let you know here if that completes.
To put context to this...
This is my mistake, that read expects a .gdb.zip
file (see .option("vsizip", "true")
), so
.zip
caused the errorAhh thank you for that. I certainly wouldn't have caught it!
It's eating something chunky.. for the hackathon would it be wise to spec more powerful clusters? I'm just testing on a Standard_DS3_v2 - 12GB and 4 cores.
Is that a single node cluster? It took me 7 minutes to do that load with the below cluster (2 workers):
I'm not sure we need to ask everyone to do the individual loads -- we can just give access to the Delta tables. However, it's worth testing how the steps later perform for you. We're running a fair amount of Spark here so it's worth having a multi-node cluster.
I should have thought about that. I don't think I have sufficient permission to create multi-node clusters. I'll chat to one of our platform admins and see if I can sort that!
Re. just providing access to the Delta tables - yes, absolutely.
Cell 28:
mupolygon_df = mos.read()\ .format("multi_read_ogr")\ .option("vsizip", "true")\ .option("layerName", "mupolygon")\ .load(f"{ROOT_PATH_SPARK}/soil/")\ .repartition(200, F.col("Shape"))\ .withColumn("geom", mos.st_updatesrid("Shape", "Shape_srid", F.lit(4326)))
mupolygon_df.display()
java.lang.NullPointerException