Open timrobertson100 opened 9 months ago
Changing to use a snapshot input drops this down to similar to production. It takes 1.4 mins to process and prepare the HFiles, but now 27 mins to read all the many avro files. If we're to optimize anything, it will need to be that (Apache Iceberg sounds appropriate to explore at some point).
Dataset<Row> source =
spark
.read()
.format("com.databricks.spark.avro")
.load("/data/hdfsview/occurrence/.snapshot/tim-occurrence-map/occurrence")
.select(
"datasetkey",
"publishingorgkey",
"publishingcountry",
"networkkey",
"countrycode",
"basisofrecord",
"decimallatitude",
"decimallongitude",
"kingdomkey",
"phylumkey",
"classkey",
"orderkey",
"familykey",
"genuskey",
"specieskey",
"taxonkey",
"year",
"occurrencestatus",
"hasgeospatialissues");
source.createOrReplaceTempView("occurrence_input");
Current version using the databricks spark input.
A CombineInputFormat might help a little although the databricks one is launching fewer tasks than files so is already somehow combining. Tests suggested it's only a little better with combining at a cost of more complex code.
It helps to prepare the table of stats for tiles (i.e large views) and broadcast that and then join where the mapKey is not found.
There is a current hack in the point tile to avoid known large views (kingdoms, all data etc).
We should consider what can or should be done to read the input quickly. The current code ran like this (600 cores, as per production) which is still ~2x quicker than the current production, but is reading from Hive, not the snapshot so not a true comparison.