Change in masked scanlines

sfinkens commented 3 years ago

The CLARA team noticed that the masking of scanlines in pygac has changed since an earlier version. I suspect this didn't happen on purpose. In the PR history I only found a refactoring (#72) that shouldn't have changed the behaviour.

@carloshorn @abhaydd IIUC you have already started discussing this. If that is indeed a bug, could you please continue your discussion here? Then we have a proper record of it.

Left: Older version. Right: Current version problematic_corrupt_lines_without_qualityflagging

mraspaud commented 3 years ago

IMO, the right side looks better...

abhaydd commented 3 years ago

Image Pasted at 2020-11-25 15-37 Image Pasted at 2020-11-25 16-00

abhaydd commented 3 years ago

Here few more examples, I received from Karl-Göran (CM-SAF). They are related to the following L1b files:

NSS.GHRR.NN.D08211.S1251.E1437.B1644445.GC NSS.GHRR.NP.D11311.S1930.E2125.B1416365.GC NSS.GHRR.NN.D15213.S2020.E2142.B5255253.WI NSS.GHRR.NN.D10190.S1525.E1650.B2646060.WI

@carloshorn Would it be please possible for you to take a look these scenes? The differences in quality flags are puzzling.

carloshorn commented 3 years ago

Yes, I could investigate which PR introduced the change. Could you tell me which version/commit was used to create the "old" images?

ninahakansson commented 3 years ago

I think the difference between masking and not masking is not the issue here. (That caused some problems too, but as data is flagged in qual_flags this is OK!) Instead it is that even in the case (left) where a lot of data are masked there are still broken data present. So I think it is a new feature that CMSAF wants: Can pygac find and flag additional lines as potentially containing broken data?

mraspaud commented 3 years ago

Here is a very simple formula I use in trollcast to check how corrupt a LAC scanline is, in case is helps:

            qual -= np.sum(abs(np.diff(line['image_data'][:, 4].astype(np.int16))) > 200)

It's just checking for then number of unnatural jumps in pixel values within the scanline.

carloshorn commented 3 years ago

@ninahakansson, anomaly detection is a serious business. It depends on the level of sophistication how much time it would require to implement it. Furthermore, it would increase the processing time, so I would switch it off as default. Some simple checks like the one proposed by @mraspaud, could be a good start. Another question, that @mraspaud, @sfinkens and @abhaydd would need to discuss, is if anomaly detection is in the scope of PyGAC, or if it should be handled by another software.

ninahakansson commented 3 years ago

@carloshorn thats is a good point. Specially if the EUMETSAT GAC FDR data is already processed. In that case anomaly detection in pygac would not help CMSAF with their final processing .

mraspaud commented 3 years ago

Looking at the bigger picture, noise detection/removal is certainly useful for more data than just AVHRR GAC, so I wouldn't tie such an algorithm to pygac. I think adding a new field flagging for suspicious data to the fdr files would be a great addition though. So somewhere in between we need the noise detection, so it could be either in satpy (or a satpy plugin) or in pygac-fdr.

kgkarl commented 3 years ago

IMO, the right side looks better...

No doubt, the problem is that there are still pixels with quality issues!

kgkarl commented 3 years ago

Here few more examples, I received from Karl-Göran (CM-SAF). They are related to the following L1b files:

NSS.GHRR.NN.D08211.S1251.E1437.B1644445.GC NSS.GHRR.NP.D11311.S1930.E2125.B1416365.GC NSS.GHRR.NN.D15213.S2020.E2142.B5255253.WI NSS.GHRR.NN.D10190.S1525.E1650.B2646060.WI

@carloshorn Would it be please possible for you to take a look these scenes? The differences in quality flags are puzzling.

Just want to say that the last example in the list is probably quite OK. It is just an example of a large contrast in brightness temperatures that actually may appear naturally (sharp temperature contrast between ocean and Galapagos islands. These kind of very warm targets can also be caused by corrupt pixels and then they might be difficult to separate from true very warm targets.

kgkarl commented 3 years ago

I just want to give you some more background information to why this issue was raised. The thing is that we have a few algorithms used to create the CM SAF CLARA data record that are based on statistical methods. One such method is the probabilistic cloud detection method (denoted CMa-prob) based on Bayesian discrimination theory. Such a method needs to be trained with real data. This was done earlier for the CLARA-A2 data record (in principle based on data from the old pygac versions). But now, in preparation for CLARA-A3, the method has been trained with data from the latest pygac version. Since the masking of data has now been removed (as exemplified by images) the contribution to the statistics from corrupt data increased considerably in the latest training sessions. This was how this story was raised. So the main problem is that problematic AVHRR data is not always accompanied with quality flags which means that methods based on statistical training is vulnerable to such data. This vulnerability increased a lot with the introduction of the latest pygac version (without the masking of data which we had before). I am definitely in favour of the solution you have presented for the latest pygac version (defining the FDR dataset) but we did not realize in time how serious this impacted the statistical methods we use. We need now to take back the previous masking for the CLARA-A3 processing (to not be too different from CLARA-A2 and to protect all other partners also using level1c-data for their CLARA-A3 products - we have no time now for adapting all partner's code for reading level1c) but this issue is raised more for future improvements in the data handling. I am sure it can be improved and look forward to pushing this work forward (even in the SAF CDOP-4 framework).

kgkarl commented 3 years ago

Yes, I could investigate which PR introduced the change. Could you tell me which version/commit was used to create the "old" images?

The old version was pygac-1.2.0.dist-info (if I have understood it correctly).

carloshorn commented 3 years ago

Yes, I could investigate which PR introduced the change. Could you tell me which version/commit was used to create the "old" images?

The old version was pygac-1.2.0.dist-info (if I have understood it correctly).

Thanks @kgkarl!

Looking at the bigger picture, noise detection/removal is certainly useful for more data than just AVHRR GAC, so I wouldn't tie such an algorithm to pygac. I think adding a new field flagging for suspicious data to the fdr files would be a great addition though. So somewhere in between we need the noise detection, so it could be either in satpy (or a satpy plugin) or in pygac-fdr.

@mraspaud, this was also my intention. I would vote for a general noise handling plugin in satpy. As a simple use-case, think about someone who just wants a picture, not necessarily in the scientific context, without noise or missing lines. It would be nice to have a noise removal and maybe also an inpainting library attached to satpy, which as a by-product would provide some level of noise detection. However, there might be still some more sophisticated, but at the same time more detector specific, algorithms, which would be well placed in pygac-fdr (if we have time and budget to develop them during the fdr validation).

abhaydd commented 3 years ago

@kgkarl KG, Thanks for chipping in and explaining the background. I also agree with @carloshorn and @mraspaud that how to apply quality control is in itself a big, separate issue with many layers involved. We would probably need a separate meeting and activity on that. How to handle solar contamination in Ch3b, Ch4 and Ch5 is an another sword that has been hanging on our heads since a long time :)

carloshorn commented 3 years ago

While refactoring the creation of the corrupt mask, I introduced a bug. However, I am glad that we looked at it, because it shows that we should probably not just discard any flagged line (set it to NaN). There seems to be some usable data inside. If for example the geolocation of a line is missing, we can interpolate it. Especially, in the context of noise reduction, we may keep all information that we extract from the file and maybe use masked arrays to highlight lines and pixels under consideration.

mraspaud commented 3 years ago

Closing this as it has been fixed and merged in #91

pytroll / pygac

Change in masked scanlines #90