opendatacube / odc-stats

Statistician is a framework of tools for generating statistical summaries of large collections of EO data managed in an ODC instance.
Apache License 2.0
9 stars 4 forks source link

Updated gm plugin to make it faster to run for S2 #90

Closed fangfy closed 1 year ago

fangfy commented 1 year ago

Changes made to remove hard-coded contiguity check and allow multiple cloud classes to be masked in the same data read.

This is still slower (~50% longer) and requires more memory to run than the version from before Jun 2022. Suggestions appreciated to improve the efficiency.

Tried to allow the case when cloud_filters is provided in the previous format but not sure if this will pass all tests.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 91.66% and project coverage change: +0.01 :tada:

Comparison is base (0826115) 75.57% compared to head (8a68a43) 75.59%.

:exclamation: Current head 8a68a43 differs from pull request most recent head f915e73. Consider uploading reports for the commit f915e73 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #90 +/- ## =========================================== + Coverage 75.57% 75.59% +0.01% =========================================== Files 33 33 Lines 3169 3175 +6 =========================================== + Hits 2395 2400 +5 - Misses 774 775 +1 ``` | [Impacted Files](https://codecov.io/gh/opendatacube/odc-stats/pull/90?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opendatacube) | Coverage Δ | | |---|---|---| | [odc/stats/plugins/gm.py](https://codecov.io/gh/opendatacube/odc-stats/pull/90?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opendatacube#diff-b2RjL3N0YXRzL3BsdWdpbnMvZ20ucHk=) | `97.87% <91.66%> (-1.00%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opendatacube). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opendatacube)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

emmaai commented 1 year ago

The acceptable format of cloud_filter is here https://github.com/GeoscienceAustralia/dea-config/blob/a0736288da43f8e1849a5ba09280e2c17a7068ab/dev/services/odc-stats/geomedian/ga_ls8c_nbart_gm_cyear_3.yaml#L3-L7

emmaai commented 1 year ago

Had a look, these lines will drag slow the processing https://github.com/opendatacube/odc-stats/blob/fe5f390013b0c034ec90d80d961a47c56492433c/odc/stats/plugins/gm.py#L85-L97

It's to accommodate the need to dilate cloud and shadow with different pixels. For LS, the expense is necessary. For sentinel 2 I can see 4 of them got the same pixels. So I guess the change made in this pr was meant to achieve the previous behaviour. It might make dilation faster by reducing the graph earlier, but it wouldn't make memory usage less, neither would it speed up the loading.

Re: memory In the code prior 06-2022, the mask band was loaded twice, once for masking, once for dilation. So the memory was released earlier than gathering and computing everything in one big graph (which is what happens with the current code).

Re: processing time I got some guesses, several things could be changed see if it helps:

fangfy commented 1 year ago

Thanks @emmaai @Kirill888. To clarify, without this PR, the code is several times slower to run and uses even more momery. Main culprit for excessive memory usage is the unnecessary contiguity check. Doing the cloud masking in one iteration instead of many definitely speed up the code significantly. I understand the logic for wanting to do different dilation for different flags, hence happy to keep the option to provide multiple flag and filtering sets and wanted to make sure the landsat config still works.

I'll try keep_good_only(..., invert=True).

fangfy commented 1 year ago

I don't think I can make the code to run faster at this point, so I'm merging the PR.