timrobertson100 / map-spark-sql

Exploring a better map batch process for GBIF
0 stars 0 forks source link

Consider optimisations for the point tile #5

Open timrobertson100 opened 9 months ago

timrobertson100 commented 9 months ago

There is a current hack in the point tile to avoid known large views (kingdoms, all data etc).

We should consider what can or should be done to read the input quickly. The current code ran like this (600 cores, as per production) which is still ~2x quicker than the current production, but is reading from Hive, not the snapshot so not a true comparison.

image
timrobertson100 commented 9 months ago

Changing to use a snapshot input drops this down to similar to production. It takes 1.4 mins to process and prepare the HFiles, but now 27 mins to read all the many avro files. If we're to optimize anything, it will need to be that (Apache Iceberg sounds appropriate to explore at some point).

    Dataset<Row> source =
        spark
            .read()
            .format("com.databricks.spark.avro")
            .load("/data/hdfsview/occurrence/.snapshot/tim-occurrence-map/occurrence")
            .select(
                "datasetkey",
                "publishingorgkey",
                "publishingcountry",
                "networkkey",
                "countrycode",
                "basisofrecord",
                "decimallatitude",
                "decimallongitude",
                "kingdomkey",
                "phylumkey",
                "classkey",
                "orderkey",
                "familykey",
                "genuskey",
                "specieskey",
                "taxonkey",
                "year",
                "occurrencestatus",
                "hasgeospatialissues");
    source.createOrReplaceTempView("occurrence_input");
image
timrobertson100 commented 9 months ago

Current version using the databricks spark input.

A CombineInputFormat might help a little although the databricks one is launching fewer tasks than files so is already somehow combining. Tests suggested it's only a little better with combining at a cost of more complex code.

It helps to prepare the table of stats for tiles (i.e large views) and broadcast that and then join where the mapKey is not found.

Screenshot 2023-10-07 at 15 23 26