Closed LiangJYu closed 1 year ago
Now that we are allowing to process multiple burst IDs within a single SAFE file, do we have a plan to allow the user of setting more than one geogrid? Theoretically, as many geogrids as the number of bursts that the user decides to process. I have not seen this addressed in this PR.
@vbrancat That's a very good point. What are our choices? I can think of two: 1) add the geogrid of all requested bursts to the run config 2) introduce an ancillary dataset (database) which for each ID has predefined geogrid parameters.
I think option 2 is cleaner. The SAS then will query the database for each ID and extracts the geogrid that needs to be used.
What do you think? can you think of another chpice?
@hfattahi Note that internally our geogrid
is a dictionary of geogrids with keys being the different burst IDs https://github.com/opera-adt/COMPASS/blob/619c7a1507faca68cdb4ac5bd74c6b598c3c78c8/src/compass/utils/geo_grid.py#L340
So, internally we do handle the problem correctly i.e., we sort of have a database of geogrids and we only extract the one corresponding to the requested burst ID for processing.
However we have not addressed the following:
bbox
in the burst map, this feature is highly desirable)@hfattahi Note that internally our
geogrid
is a dictionary of geogrids with keys being the different burst IDsSo, internally we do handle the problem correctly i.e., we sort of have a database of geogrids and we only extract the one corresponding to the requested burst ID for processing. However we have not addressed the following:
- Modify the schema to support multiple user-defined geogrids (if we decide to allocate the
bbox
in the burst map, this feature is highly desirable)- introduce proper checking i.e., number of allocated geogrids must be equal to the number of user-provided burst IDs
- correctly modify the internal default generated geogrid based on the user-defined parameters
- loop over available geogrids when generating GSLC
Yes, I'm aware of what we have. To be clear the two options I was talking about, were the two options for the interface. I think all you have listed above falls into my suggested option 1. If I understand correctly, you suggest that the user (e.g., PCM) populates the run config with multiple geogrids (up to 27). However, I think the other option would be to change the interface , allow for an extra input ancillary file which is supposed to contain the database of the burst IDs and the geogrid of each ID. This would be option 2. Probably there are pros and cons to both options. Option 1 adds some work on PCM side. Option 2 adds more on SAS side but keeps the interface a bit cleaner.
I understand what you mean now. So the preparation of the database would fall on us instead of PCM. I am all for option 2 for production. If I understand correctly, the database would only cover US and its territories, how do we handle users that want to use the workflow in other areas of the world?
If we go this way, Yes we should create that database and deliver to SDS. It will be static, PCM can stage for the PGE/SAS to use it. I assume similarly the outside users, could either use the same database (assuming we make it available), or we could provide a pre-processor to generate that database locally for users area of interest (I think we already have couple of prototypes). Perhaps the input database could be optional for the SAS and if the user has not provided that input, then the SAS determines the output geogrid for the bursts in the input zip file (Obviously in this case the shape of the geocoded grids won't be the same through time. )
Would looking up database replace generate_geogrids
in geo_runconfig.load_from_yaml
?
The replacement would look something like:
def geogrids_from_db(bursts):
geogrids = {}
for burst in burts:
geogrid[burst.burst_id] = burst_geogrid_db.get(burst.burst_id)
return geogrids
3c3ce09 (untested) addresses loading geogrids from the burst database. The database file is nearly 400MB and I have concerns about including it in the repo. @vbrancat @hfattahi what are your thoughts? Maybe the databases is for the entire world and we can make an OPERA specific one that hopefully smaller to include with the repo?
3c3ce09 (untested) addresses loading geogrids from the burst database. The database file is nearly 400MB and I have concerns about including it in the repo. @vbrancat @hfattahi what are your thoughts? Maybe the databases is for the entire world and we can make an OPERA specific one that hopefully smaller to include with the repo?
Our plan is to expose the database at a schema/runconfig level not to include it in the repository. The idea is that PCM will have a copy of the database stored locally, will provide the database to the SAS together with the burst ID, SAFE file, orbits and DEM, and the SAS gets from the database the geogrid bounding box coordinates of the bursts to process.
The database file is nearly 400MB and I have concerns about including it in the repo.
not saying we should include it in the repo, but just noting that when the database only contains burst_id, epsg, and x/y min/maxes it's ~46M for the world
Everything works on my end now. @vbrancat Thanks for preliminary testing of half-baked code. @scottstanie Thanks for helping me debug my SQL mess.
I have tested the default behavior for this PR i.e., when no burst_id
is assigned, the SAS should process all the bursts inside the SAFE file.
In this case, the SAS run to completion but all the bursts are geocoded not onto the correct geogrid. The issue seems to be related to an erroneous relation between burst_id
to process and assigned geogrid.
For the burst_id t078_165495_iw1
, the geogrid grabbed from the burst DB available at /u/aurora-r0/staniewi/dev/burst_map_bbox_only.sqlite3
is 280200.0 1990550.0 377250.0 2036550.0
while the SAS assigns (361550.0, 2012950.0, 462550.0, 2058450.0)
.
Printing the EPSG, burst_id and bbox at this line https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/geo_grid.py#L317
EPSGs [32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614, 32614]
Bboxes [(365000.0, 1994450.0, 465950.0, 2040000.0), `(280200.0, 1990550.0, 377250.0, 2036550.0)`, (361550.0, 2012950.0, 462550.0, 2058450.0), (446900.0, 2034900.0, 537750.0, 2079750.0), (276750.0, 2009050.0, 373800.0, 2055000.0), (358100.0, 2031400.0, 459100.0, 2076850.0), (443450.0, 2053350.0, 534300.0, 2098150.0), (273300.0, 2027550.0, 370350.0, 2073450.0), (354700.0, 2049850.0, 455650.0, 2095300.0), (440000.0, 2071750.0, 530900.0, 2116550.0), (269850.0, 2046050.0, 366900.0, 2091950.0), (351250.0, 2068300.0, 452250.0, 2113750.0), (436600.0, 2090200.0, 527450.0, 2134950.0), (266400.0, 2064550.0, 363500.0, 2110400.0), (347800.0, 2086800.0, 448800.0, 2132150.0), (433200.0, 2108600.0, 524050.0, 2153350.0), (263000.0, 2083050.0, 360050.0, 2128850.0), (344400.0, 2105250.0, 445400.0, 2150600.0), (429750.0, 2127050.0, 520650.0, 2171750.0), (259550.0, 2101550.0, 356650.0, 2147350.0), (341000.0, 2123700.0, 442000.0, 2169050.0), (426350.0, 2145500.0, 517250.0, 2190150.0), (256150.0, 2120050.0, 353200.0, 2165800.0), (337600.0, 2142200.0, 438600.0, 2187450.0), (423000.0, 2163900.0, 513850.0, 2208550.0), (252750.0, 2138550.0, 349800.0, 2184250.0), (419700.0, 2182350.0, 510550.0, 2227000.0)]
Burst IDs {'t078_165503_iw1', 't078_165496_iw1', 't078_165495_iw1', 't078_165502_iw1', 't078_165502_iw2', 't078_165503_iw3', 't078_165498_iw1', 't078_165496_iw2', 't078_165497_iw2', 't078_165495_iw2', 't078_165498_iw3', 't078_165499_iw1', 't078_165497_iw3', 't078_165499_iw2', 't078_165499_iw3', 't078_165501_iw3', 't078_165494_iw2', 't078_165498_iw2', 't078_165501_iw2', 't078_165496_iw3', 't078_165497_iw1', 't078_165500_iw3', 't078_165500_iw2', 't078_165501_iw1', 't078_165500_iw1', 't078_165502_iw3', 't078_165495_iw3'}
It seems that the bbox of t078_165495_iw1
(in bold) gets assigned to t078_165496_iw1
which in the list of bursts to process is the burst immediately before t078_165495_iw1
.
Most likely, there is something strange in the way the database is queried in this function https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/helpers.py#L222
The other tested behavior is when the user assigns one burst_id
or a list of burst_id
in the run configuration file. In this case, the SAS does not run to completion but it throws the following error:
Traceback (most recent call last):
File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 54, in <module>
main()
File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 49, in main
run(parser.run_config_path, parser.args.grid_type)
File "/mnt/aurora-r0/vbrancat/codes/COMPASS/src/compass/s1_cslc.py", line 40, in run
cfg = GeoRunConfig.load_from_yaml(run_config_path, 's1_cslc_geo')
File "/mnt/aurora-r0/vbrancat/codes/miniconda3/envs/compass/lib/python3.10/site-packages/compass/utils/geo_runconfig.py", line 86, in load_from_yaml
geogrids = generate_geogrids_from_db(bursts, geocoding_dict,
File "/mnt/aurora-r0/vbrancat/codes/miniconda3/envs/compass/lib/python3.10/site-packages/compass/utils/geo_grid.py", line 320, in generate_geogrids_from_db
epsg_bbox_dict = dict(zip(burst_ids, zip(epsgs, bboxes)))
TypeError: 'int' object is not iterable
This is because the EPSG here https://github.com/LiangJYu/COMPASS/blob/342f523787765a0126963e62e30522323e96dfc2/src/compass/utils/geo_grid.py#L317 is not a list but an integer
I think I see why it's getting mixed up:
we're getting the ordering of the burst_ids here
# get all burst IDs and their EPSGs + bounding boxes
burst_ids = set([b.burst_id for b in bursts])
then ignoring the return value from helpers.burst_bbox_from_db
in the next line
epsgs, bboxes, _ = helpers.burst_bbox_from_db(burst_ids, burst_db_file)
but the sql query will generally produce rows in a different order than you supply it when you use WHERE IN
, so you need to use the return value (if you want to keep the SQL as is)
As @scottstanie noticed, generating the DEM raster is a quick fix that solves the artifact issue. This PR seems good to go if the following gets addressed:
for
loop and delete DEM raster (see Scott comment)output_epsg
, top_left
, bottom_right
entries from the geocoded CSLC schema. Their behavior is undefined if we process more than one burst. If we remove them, we might break the stack processorIt seems to be working once I get the YAML correctly. There will be some adjustments to the s1_geocode_stack
script now that we've changed things.
That should be a separate PR, correct?
It seems to be working once I get the YAML correctly. There will be some adjustments to the
s1_geocode_stack
script now that we've changed things.That should be a separate PR, correct?
Correct. It would be better if it is a separate PR :)
This PR allows multiple bursts to processed by all COMPASS workflow modules.
If no burst ID(s) are given in the runconfig, then all bursts in the provided SAFE will be processed.
To-do: testing