The purpose of this PR is to discuss a few possible optimizations (RAM usage, coverage), additional parameters.
This PR is probably not mature enough to be merged.
It probably contains bugs and breaks some existing API.
@foobarbecue feel free to cherry-pick what you deem useful ;)
RAM optimizations
Parsing NAC index
This PR adds chunks processing of the NAC index
it keeps memory usage low while parsing the NAC index
It works thanks to the chunksize parameter provided by pandas.read_csv
# read nac_index as chunks instead of reading everything at once in memory
nac_index = pandas.read_csv(indfilepath, header=None, names=col_list, chunksize=chunksize)
Then the join operation is done over each chunk of lines:
filtered = []
chunks_metadata = load_nac_metadata.load_nac_index(
indfilepath=indfilepath, lblfilepath=lblfilepath
)
for (
chunk_metadata
) in chunks_metadata: # iterate over chunks instead of loading all CSV into RAM
chunk_footprints = footprints.join(
chunk_metadata, how="inner", lsuffix="_ode", rsuffix=""
)
# redacted...
filtered.append(chunk_footprints)
return pandas.concat(filtered).dropna()
When we find many image candidates for pairs geopandas.overlay explodes (RAM and computation time).
It wastes a lot of time computing a huge number of pairs that are mostly discarded by filters (sun geometry and area)
This PR adds a generator that yields chunks of pairs and filter them as they are generated.
That way RAM usage is kept low and it is possible to abort early if enough pairs were found.
def pairs_iter_from_image_search(imagesearch: "ImageSearch") -> Iterator[GeoDataFrame]:
gdf: GeoDataFrame = imagesearch.results.dropna()
# Store index (product id) in column so that it's preserved in spatial join operation
gdf["prod_id"] = gdf.index
chunk_row_size = 100
chunks = [
gdf[i : i + chunk_row_size] for i in range(0, gdf.shape[0], chunk_row_size)
]
for (chunk_1, chunk_2) in tqdm.tqdm(
itertools.product(chunks, chunks), total=(len(chunks) ** 2.0)
):
pairs = geopandas.overlay(chunk_1, chunk_2, how="union", keep_geom_type=True)
# redacted ...
yield pairs
generator that returns chunks of pairs
And then the generator is used here:
filtered_pairs = []
filtered_pairs_count = 0
for pairs_chunk in pairs_iter_from_image_search(imgs):
filtered_chunk_pairs = filter_small_overlaps(
filter_sun_geometry(
pairs_chunk, incidence_range=(incidence_range_low, incidence_range_high)
)
)
filtered_pairs.append(filtered_chunk_pairs)
filtered_pairs_count += len(filtered_chunk_pairs)
if filtered_pairs_count > max_pairs and verbose:
print(f"found {filtered_pairs_count} pairs > --max-pairs={max_pairs}")
break
pairs = pandas.concat(filtered_pairs)
Improved coverage
When looking for a minimal set of pairs that covers an area, it seems the way the code chooses a new point is flawed.
Instead of:
# always pick the same point even if it could not find a pair to cover it...
search_point = remaining_uncovered_poly.representative_point()
This PR uses:
# pick a random point in the remaining uncovered poly
search_point = random_points_in_polygon(remaining_uncovered_poly, 1)[0]
This change helps with coverage close to the equator where finding pairs is harder.
New parameters
indfilepath and lblfilepath: paths for INDEX.TAB and INDEX.LBL
max_pairs: stop looking for pairs as soon as we found at least max_pairs
miss_limit: how many times we may fail to cover a point when providing --find-covering=True
incidence_range_low, incidence_range_high: filter out pairs for which image sun incidence are outside this range. This helps finding better pairs when close to north/south poles...
json_output: path for dumping the JSON containing the pairs.
Misc
This PR also modifies the download_NAC.py so that it can read pairs from the json written by find_stereo_pairs.py and download the corresponding images (in parallel).
Motivation
The purpose of this PR is to discuss a few possible optimizations (RAM usage, coverage), additional parameters.
@foobarbecue feel free to cherry-pick what you deem useful ;)
RAM optimizations
Parsing NAC index
It works thanks to the
chunksize
parameter provided bypandas.read_csv
Then the join operation is done over each chunk of lines:
Computing pairs (overlay)
geopandas.overlay
and to some extentgeopandas
have performance issues.When we find many image candidates for pairs
geopandas.overlay
explodes (RAM and computation time).It wastes a lot of time computing a huge number of pairs that are mostly discarded by filters (sun geometry and area)
This PR adds a generator that yields chunks of pairs and filter them as they are generated.
That way RAM usage is kept low and it is possible to abort early if enough pairs were found.
And then the generator is used here:
Improved coverage
When looking for a minimal set of pairs that covers an area, it seems the way the code chooses a new point is flawed.
Instead of:
This PR uses:
This change helps with coverage close to the equator where finding pairs is harder.
New parameters
indfilepath
andlblfilepath
: paths for INDEX.TAB and INDEX.LBLmax_pairs
: stop looking for pairs as soon as we found at leastmax_pairs
miss_limit
: how many times we may fail to cover a point when providing--find-covering=True
incidence_range_low
,incidence_range_high
: filter out pairs for which image sun incidence are outside this range. This helps finding better pairs when close to north/south poles...json_output
: path for dumping the JSON containing the pairs.Misc
This PR also modifies the
download_NAC.py
so that it can read pairs from the json written byfind_stereo_pairs.py
and download the corresponding images (in parallel).