gerritholl commented 3 years ago

Plugin to check validity of channel data before continuing processing. This will help us when AVHRR switches 3A/3B so we don't generate empty channels and products.

[x] Closes #111
[x] Tests added
[x] Tests passed
[x] Passes flake8 trollflow2
[x] Fully documented

codecov[bot] commented 3 years ago

Codecov Report

Merging #112 (e8daef1) into main (874427e) will increase coverage by 0.46%. The diff coverage is 98.65%.

@@            Coverage Diff             @@
##             main     #112      +/-   ##
==========================================
+ Coverage   94.63%   95.10%   +0.46%     
==========================================
  Files           9        9              
  Lines        1977     2125     +148     
==========================================
+ Hits         1871     2021     +150     
+ Misses        106      104       -2

Flag	Coverage Δ
unittests	`95.10% <98.65%> (+0.46%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
trollflow2/plugins/__init__.py	`91.82% <97.26%> (+1.74%)`	:arrow_up:
trollflow2/tests/test_trollflow2.py	`99.39% <100.00%> (+0.05%)`	:arrow_up:
trollflow2/launcher.py	`89.24% <0.00%> (+0.03%)`	:arrow_up:
trollflow2/tests/test_launcher.py	`98.66% <0.00%> (+0.04%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 874427e...e8daef1. Read the comment docs.

gerritholl commented 3 years ago

On hold considering the question on the optimal approach here. As implemented as present, data are calculated twice, which almost doubles the total time to generate products. This is not acceptable.

Alternatives:

Find a way similar to what I'm doing now (check the data), but make sure the products are only calculated once, not twice. This could be done by deferring the checks to just before save_datasets. Advantage: broadly applicable (validity checks covers not only AVHRR 3A/3B, but any bad/flagged data), and I've already partly implemented it. Disadvantage: channel data that aren't there are still resampled, which is a waste of time.
Use per-channel start/end times in the to calculate per-channel area coverage. This would require changing all the AVHRR readers (or at least the ones we're using) to adapt per-channel start/end times and changing the covers() plugin to calculate coverage per group of products sharing a start/end time.

The second alternative seems cleaner, but also more work. The first alternative would also cover cases with excessive bad data other than absent channels, but I don't know if that use case is needed.

gerritholl commented 3 years ago

I think I need a combination of both. Although https://github.com/pytroll/satpy/pull/1653 works now and could be used in a new or adapted trollflow2 plugin, there are still remaining issues. The start and end times are still valid only for the original (set of) files read; they are not updated after resampling, and updating them after resampling would be somewhat involved. Secondly, some data files (such as hrpt_noaa19_20210423_1411_62891.l1b) lie about the presence of 3A/3B, so even with this satpy PR, the data are claimed to be there all the time (according to the 3a/3b bit), but data contents are empty/fill value.

A solution could be a combined approach. First I do a check how present a certain channel is, using the new updated start_time and end_time attributes. Then, after or inside the save_datasets plugin, I reject products with too much invalid data.

gerritholl commented 3 years ago

Despite the passing unit tests, this is not yet function. I don't understand how the unit tests pass, but the new log message reveals in practice "Found 0 unique start/end time pair(s) in scene". Needs work, next week.

gerritholl commented 3 years ago

I've given up on trying to use product-specific start_time and end_time as being too complicated and an insufficient solution. Currently, covers() can be done before loading any datasets. A product-specific covers() would need to be done after loading datasets, because it needs the start_time and end_time for each dataset. That means the plugin should be a separate function in additional to covers(). But the start_time and end_time are not updated when resampling. Some products may not have a start_time and end_time. And even if I could get it to work, it wouldn't replace the issue with filtering, becauset he data source isn't always truthful about what channel (3A or 3B) is active.

So back to the content filtering solution, now in check_valid() I'm replacing the dataset as stored in the product list by the results of its ..persist() method, which should hopefully prevent double calculation. Nevertheless, it's still much slower to use check_valid() than to not use it: for one example file (20210503135536_NOAA_19.hmf), it takes 01:02 to produce 36 files when skipping check_valid() and 02:35 when using check_valid(). I don't understand why.

gerritholl commented 3 years ago

With the latest commit, producing 36 files for 20210503135536_NOAA_19.hmf takes 01:21. Without check_valid() it takes 01:02. With the old check_valid() from earlier commits it takes 02:35.

gerritholl commented 3 years ago

I don't know how to solve the CodeFactor and codecov/project issues. Any hints?

mraspaud commented 3 years ago

One solution might be to start having each plugin in a separate file

gerritholl commented 3 years ago

One solution might be to start having each plugin in a separate file

Good idea (I don't like functional code in __init__.py anyway). Can we postpone that to another PR?

gerritholl commented 3 years ago

After the last three commits tests seem to pass still, but I'd like to let it run for a bit on a test system to make sure it's still doing what I expect.

pytroll / trollflow2

Skip products when data contain too many fill values #112

Codecov Report