pytroll / trollflow2

Next generation Trollflow. Trollflow is for batch-processing satellite data using Satpy
https://trollflow2.readthedocs.org/
GNU General Public License v3.0
10 stars 15 forks source link

Skip products when data contain too many fill values #112

Closed gerritholl closed 3 years ago

gerritholl commented 3 years ago

Plugin to check validity of channel data before continuing processing. This will help us when AVHRR switches 3A/3B so we don't generate empty channels and products.

codecov[bot] commented 3 years ago

Codecov Report

Merging #112 (e8daef1) into main (874427e) will increase coverage by 0.46%. The diff coverage is 98.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #112      +/-   ##
==========================================
+ Coverage   94.63%   95.10%   +0.46%     
==========================================
  Files           9        9              
  Lines        1977     2125     +148     
==========================================
+ Hits         1871     2021     +150     
+ Misses        106      104       -2     
Flag Coverage Δ
unittests 95.10% <98.65%> (+0.46%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
trollflow2/plugins/__init__.py 91.82% <97.26%> (+1.74%) :arrow_up:
trollflow2/tests/test_trollflow2.py 99.39% <100.00%> (+0.05%) :arrow_up:
trollflow2/launcher.py 89.24% <0.00%> (+0.03%) :arrow_up:
trollflow2/tests/test_launcher.py 98.66% <0.00%> (+0.04%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 874427e...e8daef1. Read the comment docs.

gerritholl commented 3 years ago

On hold considering the question on the optimal approach here. As implemented as present, data are calculated twice, which almost doubles the total time to generate products. This is not acceptable.

Alternatives:

The second alternative seems cleaner, but also more work. The first alternative would also cover cases with excessive bad data other than absent channels, but I don't know if that use case is needed.

gerritholl commented 3 years ago

I think I need a combination of both. Although https://github.com/pytroll/satpy/pull/1653 works now and could be used in a new or adapted trollflow2 plugin, there are still remaining issues. The start and end times are still valid only for the original (set of) files read; they are not updated after resampling, and updating them after resampling would be somewhat involved. Secondly, some data files (such as hrpt_noaa19_20210423_1411_62891.l1b) lie about the presence of 3A/3B, so even with this satpy PR, the data are claimed to be there all the time (according to the 3a/3b bit), but data contents are empty/fill value.

A solution could be a combined approach. First I do a check how present a certain channel is, using the new updated start_time and end_time attributes. Then, after or inside the save_datasets plugin, I reject products with too much invalid data.

gerritholl commented 3 years ago

Despite the passing unit tests, this is not yet function. I don't understand how the unit tests pass, but the new log message reveals in practice "Found 0 unique start/end time pair(s) in scene". Needs work, next week.

gerritholl commented 3 years ago

I've given up on trying to use product-specific start_time and end_time as being too complicated and an insufficient solution. Currently, covers() can be done before loading any datasets. A product-specific covers() would need to be done after loading datasets, because it needs the start_time and end_time for each dataset. That means the plugin should be a separate function in additional to covers(). But the start_time and end_time are not updated when resampling. Some products may not have a start_time and end_time. And even if I could get it to work, it wouldn't replace the issue with filtering, becauset he data source isn't always truthful about what channel (3A or 3B) is active.

So back to the content filtering solution, now in check_valid() I'm replacing the dataset as stored in the product list by the results of its ..persist() method, which should hopefully prevent double calculation. Nevertheless, it's still much slower to use check_valid() than to not use it: for one example file (20210503135536_NOAA_19.hmf), it takes 01:02 to produce 36 files when skipping check_valid() and 02:35 when using check_valid(). I don't understand why.

gerritholl commented 3 years ago

With the latest commit, producing 36 files for 20210503135536_NOAA_19.hmf takes 01:21. Without check_valid() it takes 01:02. With the old check_valid() from earlier commits it takes 02:35.

gerritholl commented 3 years ago

I don't know how to solve the CodeFactor and codecov/project issues. Any hints?

mraspaud commented 3 years ago

One solution might be to start having each plugin in a separate file

gerritholl commented 3 years ago

One solution might be to start having each plugin in a separate file

Good idea (I don't like functional code in __init__.py anyway). Can we postpone that to another PR?

gerritholl commented 3 years ago

After the last three commits tests seem to pass still, but I'd like to let it run for a bit on a test system to make sure it's still doing what I expect.