ssec / polar2grid

Tools for reading, remapping, and writing satellite instrument data.
http://www.ssec.wisc.edu/software/polar2grid/
GNU General Public License v3.0
72 stars 34 forks source link

MODIS overpass version 2.3 version 3.0 processing speeds #423

Closed kathys closed 2 years ago

kathys commented 2 years ago

I wrote scripts to time the creation of all P2G MODIS GeoTIFF default bands for a 15 minutes pass using the default WGS84 dynamic grid. For Version 2.3 P2G I added the times together that it took create the default images for crefl (true and false color) and for the vis/ir bands. For P2G version 3.0, I used 4 workers, which I think is the default. Here are the results:

P2G Version 2.3: 8m22s P2G Version 3.0: 15m33s

It took almost twice as long using more CPU's to create the images using P2G v3.0 than it did with V2.3 on the machine bumi. I cannot imagine releasing software until this is improved. Dynamic grids are the way the software is most used by Liam and I in SSEC, and I suspect most used by the community too.

djhoese commented 2 years ago

Could you give me the exact commands run and individual timings for each command if you have them (if not that's fine).

kathys commented 2 years ago

Version 3.0:

/data/users/kathys/polar2grid-swbundle-20211219-125805/bin/polar2grid.sh -r modis_l1b -w geotiff --num-workers 4 --fill-value 0 -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Sorting and reading input files... INFO : Loading product metadata from files... INFO : Using default product list: ['vis01', 'vis02', 'vis03', 'vis04', 'vis05', 'vis06', 'vis07', 'vis26', 'bt20', 'bt21', 'bt22', 'bt23', 'bt24', 'bt25', 'bt27', 'bt28', 'bt29', 'bt30', 'bt31', 'bt32', 'bt33', 'bt34', 'bt35', 'bt36', 'true_color', 'false_color', 'fog'] INFO : Running day coverage filtering... INFO : Running night coverage filtering... INFO : Computing dynamic grid parameters... INFO : Checking products for sufficient output grid coverage (grid: 'wgs84_fit')... INFO : Resampling to 'wgs84_fit' using 'ewa' resampling... INFO : Computing products and saving data to writers... INFO : SUCCESS

real 15m33.582s user 21m7.667s sys 8m54.846s

Version 2.3

/data/users/kathys/polar2grid_v_2_3/bin/polar2grid.sh crefl gtiff -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Initializing reader... INFO : Could not find any existing crefl output will use MODIS SDRs to create some ..... INFO : Saving true color image to filename 'grid_wgs84_fit_true_color.dat' INFO : Creating output from data mapped to grid wgs84_fit INFO : Creating geotiff 'aqua_modis_modis_crefl04_500m_20210318_060500_wgs84_fit.tif' INFO : Creating geotiff 'aqua_modis_modis_crefl03_500m_20210318_060500_wgs84_fit.tif' INFO : Creating geotiff 'aqua_modis_modis_crefl01_500m_20210318_060500_wgs84_fit.tif' INFO : Creating geotiff 'aqua_modis_modis_crefl01_250m_20210318_060500_wgs84_fit.tif' INFO : Creating geotiff 'aqua_modis_true_color_20210318_060500_wgs84_fit.tif' INFO : Processing data for grid wgs84_fit complete

real 3m8.224s user 2m25.329s sys 0m41.183s

/data/users/kathys/polar2grid_v_2_3/bin/polar2grid.sh modis gtiff -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Initializing reader... .... INFO : Processing data for grid wgs84_fit complete

real 5m14.793s user 3m58.525s sys 1m16.368s

kathys commented 2 years ago

In fairness, I forgot to include the creation of the false color image when executing my Version 2.3 crefl command.

djhoese commented 2 years ago

@kathys Could you get me the timings for the 2.3 false color generation and try running 3.0 generating the products in groups (all vis/ir in one group and true/false in another).

kathys commented 2 years ago

Interesting results. Version 3.0 true/false color and IR/VIS and GeoTIFF images done separately.

Running Polar2grid v3.0 true/false color modis image processing at Wed Mar 2 22:58:07 UTC 2022 /data/users/kathys/polar2grid-swbundle-20211219-125805/bin/polar2grid.sh -r modis_l1b -w geotiff --num-workers 4 --fill-value 0 -p true_color false_color -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Sorting and reading input files... INFO : Loading product metadata from files... INFO : Running day coverage filtering... INFO : Running night coverage filtering... INFO : Computing dynamic grid parameters... INFO : Checking products for sufficient output grid coverage (grid: 'wgs84_fit')... INFO : Resampling to 'wgs84_fit' using 'ewa' resampling... INFO : Computing products and saving data to writers... INFO : SUCCESS

real 10m20.211s user 15m10.958s sys 5m27.454s

/data/users/kathys/polar2grid-swbundle-20211219-125805/bin/polar2grid.sh -r modis_l1b -w geotiff --num-workers 4 --fill-value 0 -p bt20 bt21 bt22 bt23 bt24 bt25 bt27 bt28 bt29 bt30 bt31 bt32 bt33 bt34 bt35 bt36 vis01 vis02 vis03 vis04 vis05 vis06 vis07 vis26 -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Sorting and reading input files... INFO : Loading product metadata from files... INFO : Running day coverage filtering... INFO : Running night coverage filtering... INFO : Computing dynamic grid parameters... INFO : Checking products for sufficient output grid coverage (grid: 'wgs84_fit')... INFO : Resampling to 'wgs84_fit' using 'ewa' resampling... INFO : Computing products and saving data to writers... INFO : SUCCESS

real 7m48.336s user 10m12.059s sys 3m27.965s

Finished Polar2Grid v3.0 processing at Wed Mar 2 23:16:16 UTC 2022

kathys commented 2 years ago

Versioin 2.3 true/false color and IR/VIS GeoTIFF images done separately.

Running Polar2grid v2.3true/false color modis image processing at Wed Mar 2 22:48:27 UTC 2022 /data/users/kathys/polar2grid_v_2_3/bin/polar2grid.sh crefl gtiff --true-color --false-color -f /data/users/kathys/version_3_0_timing/input_modis_china INFO : Initializing reader... INFO : Could not find any existing crefl output will use MODIS SDRs to create some ... INFO : Processing data for grid wgs84_fit complete

real 4m16.376s user 3m19.179s sys 0m57.272s

Running Polar2grid v2.3 bt/vis modis image processing at Wed Mar 2 22:52:44 UTC 2022 /data/users/kathys/polar2grid_v_2_3/bin/polar2grid.sh modis gtiff -f /data/users/kathys/version_3_0_timing/input_modis_china

INFO : Processing data for grid wgs84_fit complete

real 5m23.714s user 4m1.390s sys 1m22.397s

djhoese commented 2 years ago

So doing them separately for 3.0 is...way slower. Great. And I think you forgot fog so it should be even slower. Something else is going on here. By that I mean there must be something dumb in the Satpy reader that is just dragging the entire processing down.

djhoese commented 2 years ago

How does generating all the products in 4m15s sound?

aqua_modis_true_color_20210318_060500_wgs84_fit.tif:

image

real    4m15.378s
user    8m32.510s
sys     1m35.956s

There are more optimizations that can be made, but this is the first batch. This execution also included all of the other issues you brought up (saturated pixels, band 21 might still be the wrong enhancement, etc).

kathys commented 2 years ago

How does it sound? Beauteous!!!!!!

djhoese commented 2 years ago

Sadly I can't seem to reproduce that 4m15s after I removed all the hacks and tricks I was doing. I'm currently averaging about 5m30s for all the products. More optimizations to come I hope.

djhoese commented 2 years ago

So a couple findings from our meeting yesterday and some testing I did afterward. It seems my laptop is much faster than bumi. I get similar results to Kathy when I run on bumi with the tarball (~8m for true/false). When processing all products with the 3.0 tarball it was ~14m on bumi. When processing with 8 workers instead of 4 and with the input data in /dev/shm it was still 13m9s.

On my laptop, processing all products took about 6m30s. With some of the newer optimizations that aren't included in the tarball I get 5m40s. If I run the 3.0 tarball on my laptop I get 6m47s for all products. If I run the 2.3 tarball on my laptop I get 2m56s for true/false color. For 2.3 tarball on my laptop with all the regular band products I get 3m30s. I will have to use these new times as my benchmark. I'll note though that on my laptop at least these timings are pretty close, not faster, but not 2x+ slower either.

One thing that came to mind is that the new processing is producing a true/false color with the 500m MODIS input bands instead of the 1km input bands that P2G 2.3 was using (by accident). I should try generating a 1km version and see how it performs.

I should also check TC's monitoring of bumi and see if the disk is getting overloaded and can't handle all the parallel reads and writes.

djhoese commented 2 years ago

Ran all products with true/false starting from 500m: 6m21s on my machine. Ran all products with true/false starting from 1lm: 5m59s on my machine

So a difference, but not much. This is not the main source of performance issues.

djhoese commented 2 years ago

$ PYTROLL_CHUNK_SIZE=6000 time polar2grid.sh -r modis_l1b -w geotiff --num-workers 4 --fill-value 0 -p true_color false_color bt20 bt21 bt22 bt23 bt24 bt25 bt27 bt28 bt29 bt30 bt31 bt32 bt33 bt34 bt35 bt36 vis01 vis02 vis03 vis04 vis05 vis06 vis07 vis26 -f /data/satellite/modis/input_modis_china/MYD0*

This essentially tries to run the processing as close to one single chunk as possible. It ran in 5m35s (versus 6m30s). This is still results in multiple chunks for the output grid (I'll try again to avoid that). This is very interesting. Almost a minute of execution saving just by using more memory.

Edit: This is very confusing. This processing should have been using some of the interpolation optimizations so it should have been closer to 5m40s (what I quoted in the above comment for running all products on my system).

Edit 2: Yeah running with default chunk size got me 5m36s. So...I guess chunk size doesn't matter much.

djhoese commented 2 years ago

So I did some basic tests here is some more timing information.

Default modis true/false color as of latest tarball. This uses the poor chunking strategy of the geotiepoints interpolation and nothing else special. It does not include any other optimizations.

real    5m48.814s
user    13m41.834s
sys     1m36.803s

With geotiepoints doing smarter chunking of the CVIIRS/angle-based interpolation (on optimize-modis-interp geotiepoints branch):

real    4m46.628s
user    10m41.394s
sys     1m59.493s

Using only P2G 2.3-style map_coordinates interpolation (no other optimizations):

real    3m47.945s
user    9m8.382s
sys     1m15.434s

Add ability in Polar2Grid to "persist" (compute interpolated lon/lats once and then hold them in memory) along with the map_coordinates interpolation from above. This essentially removes the overhead of interpolating multiple times as the lon/lats are stored in memory and computed once for day/night filtering, once for dynamic grid freezing, and once for actual resampling. I'm debating whether or not this should be done for all swath-based processing or only more complex ones like modis.

real    2m59.513s
user    6m55.871s
sys     1m1.015s
djhoese commented 2 years ago

Just for fun, I did all the products on my laptop like the above with the persisted lon/lats and got:

real    4m28.289s
user    9m12.405s
sys     1m24.544s
djhoese commented 2 years ago

So I just started using /data/dist/polar2grid-swbundle-20220611-202644.tar.gz which uses the persisting trick of #472 and I ran it on bumi. As a reminder v2.3 on bumi gets:

True and false color:

real 4m16.376s
user 3m19.179s
sys 0m57.272s

All band products:

real 5m23.714s
user 4m1.390s
sys 1m22.397s

The previous beta took ~14m to do all products (true/false/bands) on bumi. The new version gets:

real    11m47.786s
user    18m46.450s
sys     17m35.809s

So better, but still terrible compared to v2.3. Just bands:

real    6m45.294s
user    9m7.736s
sys     8m31.066s

Just true/false color:

real    8m42.232s
user    14m16.692s
sys     13m44.493s

I got fed up with this and tried forcing the use of "legacy EWA" which is still relatively dask friendly but requires loading all the data into memory at once. This does all normal bands in 5m21s and true/false in 6m7s. So even that is not fast enough, but it is a pretty easy guess to say the crefl python algorithm is slower than the crefl C algorithm; yet another thing that could be sped up.

Edit: Just reran all band products with 2.3 on bumi and it was 5m40s.

djhoese commented 2 years ago

Running this data to grid 204 with v2.3 is 4m11s, and 3.0 beta is 4m31s.

djhoese commented 2 years ago

I have a hacked version of main running on bumi (where I copied the changed code from #483 into the previous tarball) and I now get these timings:

Regular band products:

real    4m40.545s
user    6m7.102s
sys     4m42.599s

Only true/false color:

real    6m15.633s
user    10m49.273s
sys     9m6.608s

Running all band products and true/false:

real    9m14.641s
user    15m4.571s
sys     12m8.886s

Running v2.3 on bumi again just to be sure shows all band products are produced with:

real    5m27.238s
user    4m8.161s
sys     1m18.116s

Since that is almost the same as the last time I ran it, I don't really feel like running true/false color with v2.3 right now. We'll assume the same timing of 4m16s.

In summary:

  1. All band products is now ~50s faster.
  2. True/false product generation is still slower by almost 2 minutes. The best way to get a major improvement here is to rewrite the current python-based crefl algorithm back into the original C code.
  3. As mentioned in #483, running only a few bands still takes longer in this 3.0 beta than it did in v2.3. I think this can be minimized if/when the cython (think C) version of the MODIS interpolation which is still in progress in the upstream python-geotiepoints package. Few products performing comparatively worse than processing a lot of products makes sense if you think about keeping all of the dask workers busy. In the few products case some of the dask workers end up not doing very much or anything so we aren't getting the full benefit of our dask usage. This does not explain why something that is multi-threaded is still performing worse than a single threaded (v2.3) implementation. I should also note that the cython optimizations in the MODIS interpolation would be a new optimization that was never done in v2.3, but also that the biggest improvement of that cython code is in memory usage more than in speed.
  4. Running all the products (bands + true + false) shows that it is about 30 seconds faster :sweat:

So we're not done, but this is much better.

djhoese commented 2 years ago

With the newest 20220707-161841 tarball which includes GIL release fixes for EWA, the above modis china processing case for all bands + true + false finishes in 6 minutes with 8 workers. This is 3m45s faster than v2.3 doing two calls (one for bands + one for true/false).

@kathys still reports slower times than v2.3 when run separately on bumi. This is currently only with VIIRS testing. The VIIRS case has not been checked for basic optimizations yet (ex. optimal chunking in the reader).

Edit: Doing just bands took 2m26s. Note this is still using data files on /dev/shm on bumi.

Edit 2: True and false color by themselves take 4m26s. This is just a little slower than v2.3. But this is also the 500m-based composite recipe. Version 2.3 had a bug that was always using the 1km low resolution bands as input.

Edit 3: Running all products but pointing to /data instead of /dev/shm ran in 5m45s. That's 15 seconds faster than using /dev/shm :man_shrugging:

djhoese commented 2 years ago

As pointed out on slack, specifying PYTROLL_CHUNK_SIZE=6400 before the P2G command reduces the execution time by about 50%. In @kathys's tests with VIIRS data with 10 granules, doing this chunk size showed:

v2.3:

true/false only: 4m5s all regular bands: 10m4s

v3.0: true/false only: 2m17s all regular bands: 2m47s all products together: 4m37s

This means smarter chunking and possibly per-reader chunk sizes are the next important step to ensure good performance.

djhoese commented 2 years ago

I don't remember what we said in our last meeting, but I remember saying overall that processing speeds are faster. It does still seem that true_color/false_color may be slower than it used to be but there are many factors that could be leading to that. The biggest one is likely (I think) that higher resolution bands are being used in the true_color/false_color composites than in v2.3.

Closing. We can reopen if needed or open a new one if it seems like a different issue that is causing the speed changes.