nasa-jpl / sstmp

Solar System Treks Mosaic Pipeline
Apache License 2.0
12 stars 4 forks source link

RAM optimizations, additional parameters (discussion) #33

Open PicoJr opened 2 years ago

PicoJr commented 2 years ago


The purpose of this PR is to discuss a few possible optimizations (RAM usage, coverage), additional parameters.

This PR is probably not mature enough to be merged. It probably contains bugs and breaks some existing API.

@foobarbecue feel free to cherry-pick what you deem useful ;)

RAM optimizations

Parsing NAC index

It works thanks to the chunksize parameter provided by pandas.read_csv

        # read nac_index as chunks instead of reading everything at once in memory
        nac_index = pandas.read_csv(indfilepath, header=None, names=col_list, chunksize=chunksize)

Then the join operation is done over each chunk of lines:

        filtered = []
        chunks_metadata = load_nac_metadata.load_nac_index(
            indfilepath=indfilepath, lblfilepath=lblfilepath
        for (
        ) in chunks_metadata:  # iterate over chunks instead of loading all CSV into RAM
            chunk_footprints = footprints.join(
                chunk_metadata, how="inner", lsuffix="_ode", rsuffix=""

            # redacted...

        return pandas.concat(filtered).dropna()

Computing pairs (overlay)

geopandas.overlay and to some extent geopandas have performance issues.

When we find many image candidates for pairs geopandas.overlay explodes (RAM and computation time).

It wastes a lot of time computing a huge number of pairs that are mostly discarded by filters (sun geometry and area)

This PR adds a generator that yields chunks of pairs and filter them as they are generated.

That way RAM usage is kept low and it is possible to abort early if enough pairs were found.

def pairs_iter_from_image_search(imagesearch: "ImageSearch") -> Iterator[GeoDataFrame]:
    gdf: GeoDataFrame = imagesearch.results.dropna()
    # Store index (product id) in column so that it's preserved in spatial join operation
    gdf["prod_id"] = gdf.index

    chunk_row_size = 100
    chunks = [
        gdf[i : i + chunk_row_size] for i in range(0, gdf.shape[0], chunk_row_size)
    for (chunk_1, chunk_2) in tqdm.tqdm(
        itertools.product(chunks, chunks), total=(len(chunks) ** 2.0)
        pairs = geopandas.overlay(chunk_1, chunk_2, how="union", keep_geom_type=True)
        # redacted ...
        yield pairs

generator that returns chunks of pairs

And then the generator is used here:

    filtered_pairs = []
    filtered_pairs_count = 0
    for pairs_chunk in pairs_iter_from_image_search(imgs):
        filtered_chunk_pairs = filter_small_overlaps(
                pairs_chunk, incidence_range=(incidence_range_low, incidence_range_high)
        filtered_pairs_count += len(filtered_chunk_pairs)
        if filtered_pairs_count > max_pairs and verbose:
            print(f"found {filtered_pairs_count} pairs > --max-pairs={max_pairs}")

    pairs = pandas.concat(filtered_pairs)

Improved coverage

When looking for a minimal set of pairs that covers an area, it seems the way the code chooses a new point is flawed.

Instead of:

# always pick the same point even if it could not find a pair to cover it...
search_point = remaining_uncovered_poly.representative_point()

This PR uses:

# pick a random point in the remaining uncovered poly
search_point = random_points_in_polygon(remaining_uncovered_poly, 1)[0]

This change helps with coverage close to the equator where finding pairs is harder.

New parameters


This PR also modifies the so that it can read pairs from the json written by and download the corresponding images (in parallel).