allow multiple bursts to be processed

LiangJYu commented 2 years ago

This PR allows multiple bursts to processed by all COMPASS workflow modules.

If no burst ID(s) are given in the runconfig, then all bursts in the provided SAFE will be processed.

To-do: testing

hfattahi commented 2 years ago

Now that we are allowing to process multiple burst IDs within a single SAFE file, do we have a plan to allow the user of setting more than one geogrid? Theoretically, as many geogrids as the number of bursts that the user decides to process. I have not seen this addressed in this PR.

@vbrancat That's a very good point. What are our choices? I can think of two: 1) add the geogrid of all requested bursts to the run config 2) introduce an ancillary dataset (database) which for each ID has predefined geogrid parameters.

I think option 2 is cleaner. The SAS then will query the database for each ID and extracts the geogrid that needs to be used.

What do you think? can you think of another chpice?

vbrancat commented 2 years ago

@hfattahi Note that internally our geogrid is a dictionary of geogrids with keys being the different burst IDs https://github.com/opera-adt/COMPASS/blob/619c7a1507faca68cdb4ac5bd74c6b598c3c78c8/src/compass/utils/geo_grid.py#L340 So, internally we do handle the problem correctly i.e., we sort of have a database of geogrids and we only extract the one corresponding to the requested burst ID for processing.

However we have not addressed the following:

Modify the schema to support multiple user-defined geogrids (if we decide to allocate the bbox in the burst map, this feature is highly desirable)
introduce proper checking i.e., number of allocated geogrids must be equal to the number of user-provided burst IDs
correctly modify the internal default generated geogrid based on the user-defined parameters
loop over available geogrids when generating GSLC

hfattahi commented 2 years ago

@hfattahi Note that internally our geogrid is a dictionary of geogrids with keys being the different burst IDs

https://github.com/opera-adt/COMPASS/blob/619c7a1507faca68cdb4ac5bd74c6b598c3c78c8/src/compass/utils/geo_grid.py#L340

So, internally we do handle the problem correctly i.e., we sort of have a database of geogrids and we only extract the one corresponding to the requested burst ID for processing. However we have not addressed the following:

Modify the schema to support multiple user-defined geogrids (if we decide to allocate the bbox in the burst map, this feature is highly desirable)

introduce proper checking i.e., number of allocated geogrids must be equal to the number of user-provided burst IDs

correctly modify the internal default generated geogrid based on the user-defined parameters

loop over available geogrids when generating GSLC

Yes, I'm aware of what we have. To be clear the two options I was talking about, were the two options for the interface. I think all you have listed above falls into my suggested option 1. If I understand correctly, you suggest that the user (e.g., PCM) populates the run config with multiple geogrids (up to 27). However, I think the other option would be to change the interface , allow for an extra input ancillary file which is supposed to contain the database of the burst IDs and the geogrid of each ID. This would be option 2. Probably there are pros and cons to both options. Option 1 adds some work on PCM side. Option 2 adds more on SAS side but keeps the interface a bit cleaner.

vbrancat commented 2 years ago

I understand what you mean now. So the preparation of the database would fall on us instead of PCM. I am all for option 2 for production. If I understand correctly, the database would only cover US and its territories, how do we handle users that want to use the workflow in other areas of the world?

hfattahi commented 2 years ago

If we go this way, Yes we should create that database and deliver to SDS. It will be static, PCM can stage for the PGE/SAS to use it. I assume similarly the outside users, could either use the same database (assuming we make it available), or we could provide a pre-processor to generate that database locally for users area of interest (I think we already have couple of prototypes). Perhaps the input database could be optional for the SAS and if the user has not provided that input, then the SAS determines the output geogrid for the bursts in the input zip file (Obviously in this case the shape of the geocoded grids won't be the same through time. )

LiangJYu commented 2 years ago

Would looking up database replace generate_geogrids in geo_runconfig.load_from_yaml?

The replacement would look something like:

def geogrids_from_db(bursts):
    geogrids = {}
    for burst in burts:
        geogrid[burst.burst_id] = burst_geogrid_db.get(burst.burst_id)

    return geogrids

LiangJYu commented 2 years ago

3c3ce09 (untested) addresses loading geogrids from the burst database. The database file is nearly 400MB and I have concerns about including it in the repo. @vbrancat @hfattahi what are your thoughts? Maybe the databases is for the entire world and we can make an OPERA specific one that hopefully smaller to include with the repo?

vbrancat commented 2 years ago

3c3ce09 (untested) addresses loading geogrids from the burst database. The database file is nearly 400MB and I have concerns about including it in the repo. @vbrancat @hfattahi what are your thoughts? Maybe the databases is for the entire world and we can make an OPERA specific one that hopefully smaller to include with the repo?

Our plan is to expose the database at a schema/runconfig level not to include it in the repository. The idea is that PCM will have a copy of the database stored locally, will provide the database to the SAS together with the burst ID, SAFE file, orbits and DEM, and the SAS gets from the database the geogrid bounding box coordinates of the bursts to process.

scottstanie commented 1 year ago

The database file is nearly 400MB and I have concerns about including it in the repo.

not saying we should include it in the repo, but just noting that when the database only contains burst_id, epsg, and x/y min/maxes it's ~46M for the world

LiangJYu commented 1 year ago

Everything works on my end now. @vbrancat Thanks for preliminary testing of half-baked code. @scottstanie Thanks for helping me debug my SQL mess.

vbrancat commented 1 year ago

I have tested the default behavior for this PR i.e., when no burst_id is assigned, the SAS should process all the bursts inside the SAFE file.

In this case, the SAS run to completion but all the bursts are geocoded not onto the correct geogrid. The issue seems to be related to an erroneous relation between burst_id to process and assigned geogrid.

For the burst_id t078_165495_iw1, the geogrid grabbed from the burst DB available at /u/aurora-r0/staniewi/dev/burst_map_bbox_only.sqlite3 is 280200.0 1990550.0 377250.0 2036550.0 while the SAS assigns (361550.0, 2012950.0, 462550.0, 2058450.0).

Printing the EPSG, burst_id and bbox at this line https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/geo_grid.py#L317

EPSGs [32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614]
Bboxes [(365000.0, 1994450.0, 465950.0, 2040000.0), `(280200.0, 1990550.0, 377250.0, 2036550.0)`, (361550.0, 2012950.0, 462550.0, 2058450.0), (446900.0, 2034900.0, 537750.0, 2079750.0), (276750.0, 2009050.0, 373800.0, 2055000.0), (358100.0, 2031400.0, 459100.0, 2076850.0), (443450.0, 2053350.0, 534300.0, 2098150.0), (273300.0, 2027550.0, 370350.0, 2073450.0), (354700.0, 2049850.0, 455650.0, 2095300.0), (440000.0, 2071750.0, 530900.0, 2116550.0), (269850.0, 2046050.0, 366900.0, 2091950.0), (351250.0, 2068300.0, 452250.0, 2113750.0), (436600.0, 2090200.0, 527450.0, 2134950.0), (266400.0, 2064550.0, 363500.0, 2110400.0), (347800.0, 2086800.0, 448800.0, 2132150.0), (433200.0, 2108600.0, 524050.0, 2153350.0), (263000.0, 2083050.0, 360050.0, 2128850.0), (344400.0, 2105250.0, 445400.0, 2150600.0), (429750.0, 2127050.0, 520650.0, 2171750.0), (259550.0, 2101550.0, 356650.0, 2147350.0), (341000.0, 2123700.0, 442000.0, 2169050.0), (426350.0, 2145500.0, 517250.0, 2190150.0), (256150.0, 2120050.0, 353200.0, 2165800.0), (337600.0, 2142200.0, 438600.0, 2187450.0), (423000.0, 2163900.0, 513850.0, 2208550.0), (252750.0, 2138550.0, 349800.0, 2184250.0), (419700.0, 2182350.0, 510550.0, 2227000.0)]
Burst IDs {'t078_165503_iw1', 't078_165496_iw1', 't078_165495_iw1', 't078_165502_iw1', 't078_165502_iw2', 't078_165503_iw3', 't078_165498_iw1', 't078_165496_iw2', 't078_165497_iw2', 't078_165495_iw2', 't078_165498_iw3', 't078_165499_iw1', 't078_165497_iw3', 't078_165499_iw2', 't078_165499_iw3', 't078_165501_iw3', 't078_165494_iw2', 't078_165498_iw2', 't078_165501_iw2', 't078_165496_iw3', 't078_165497_iw1', 't078_165500_iw3', 't078_165500_iw2', 't078_165501_iw1', 't078_165500_iw1', 't078_165502_iw3', 't078_165495_iw3'}

It seems that the bbox of t078_165495_iw1 (in bold) gets assigned to t078_165496_iw1 which in the list of bursts to process is the burst immediately before t078_165495_iw1.

Most likely, there is something strange in the way the database is queried in this function https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/helpers.py#L222

vbrancat commented 1 year ago

The other tested behavior is when the user assigns one burst_id or a list of burst_id in the run configuration file. In this case, the SAS does not run to completion but it throws the following error:

Traceback (most recent call last):
  File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 54, in <module>
    main()
  File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 49, in main
    run(parser.run_config_path, parser.args.grid_type)
  File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 40, in run
    cfg = GeoRunConfig.load_from_yaml(run_config_path, 's1_cslc_geo')
  File "/mnt/aurora-r0/vbrancat/codes/miniconda3/envs/compass/lib/python3.10/site-packages/compass/utils/geo_runconfig.py", line 86, in load_from_yaml
    geogrids = generate_geogrids_from_db(bursts, geocoding_dict,
  File "/mnt/aurora-r0/vbrancat/codes/miniconda3/envs/compass/lib/python3.10/site-packages/compass/utils/geo_grid.py", line 320, in generate_geogrids_from_db
    epsg_bbox_dict = dict(zip(burst_ids, zip(epsgs, bboxes)))
TypeError: 'int' object is not iterable

This is because the EPSG here https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/geo_grid.py#L317 is not a list but an integer

scottstanie commented 1 year ago

I think I see why it's getting mixed up:

we're getting the ordering of the burst_ids here

   # get all burst IDs and their EPSGs + bounding boxes
    burst_ids = set([b.burst_id for b in bursts])

https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/geo_grid.py#L314-L315

then ignoring the return value from helpers.burst_bbox_from_db in the next line

 epsgs, bboxes, _ = helpers.burst_bbox_from_db(burst_ids, burst_db_file)

but the sql query will generally produce rows in a different order than you supply it when you use WHERE IN, so you need to use the return value (if you want to keep the SQL as is)

vbrancat commented 1 year ago

As @scottstanie noticed, generating the DEM raster is a quick fix that solves the artifact issue. This PR seems good to go if the following gets addressed:

Move the DEM raster creation inside the for loop and delete DEM raster (see Scott comment)
Remove output_epsg, top_left, bottom_right entries from the geocoded CSLC schema. Their behavior is undefined if we process more than one burst. If we remove them, we might break the stack processor

scottstanie commented 1 year ago

It seems to be working once I get the YAML correctly. There will be some adjustments to the s1_geocode_stack script now that we've changed things.

That should be a separate PR, correct?

vbrancat commented 1 year ago

It seems to be working once I get the YAML correctly. There will be some adjustments to the s1_geocode_stack script now that we've changed things.

That should be a separate PR, correct?

Correct. It would be better if it is a separate PR :)

opera-adt / COMPASS

allow multiple bursts to be processed #39