ubarsc / pyshepseg

Python implementation of image segmentation algorithm of Shepherd et al (2019) Operational Large-Scale Segmentation of Imagery Based on Iterative Elimination. Remote Sensing 11(6).
https://www.pyshepseg.org
MIT License
10 stars 4 forks source link

Allow np.nan as imgNullVal #45

Closed eikeschuett closed 5 months ago

eikeschuett commented 1 year ago

I just found your implementation of the Shepherd algorithm and from my first tests I'm super happy that you already invested your time and effort for this flexible solution. However, I usually don't work with GeoTiffs but with netCDF files (and xarray), where np.nan is often used to indicate null values. If they are present, your algo crashed in the KMeans clustering with a ValueError: Input X contains NaN., even if imgNullVal = np.nan.

This is because np.nans are not filtered in fitSpectralClusters() (because np.nan != np.nan). One solution would be to add a condition, that if imgNullVal=np.nan, a filtering is done with xNonNull = xFull[~np.isnan(xFull)]. But I guess that this is not the only part of the code where an adjustment would be necessary.

Do you think such an addition would make sense?

neilflood commented 1 year ago

Thanks @eikeschuett

I think this does make sense. I will have a look, and see if there are any strong reasons not to do it. You are probably right that there will be other places which need to be changed, so I won't promise anything before I check through and do some testing, etc., but in my head, it sounds quite feasible.

Historically, the use of NaN used to be reserved for calculations which had no answer, and in the early days it incurred a large performance penalty, so was not used for common things like a null value. However, hardware now implements these exception handlers directly, so the performance is not usually a problem. Most of the common image file formats don't really support it, but as you say, netCDF and similar formats do, so we probably should support it within pyshepseg.

I will let you know how it looks.

neilflood commented 1 year ago

One question - are you using this on a float data type?

neilflood commented 1 year ago

Ping @eikeschuett @gillins @petescarth

After reviewing much of the code, it seems we need to deal with this quite independently from the null value question. I had not really thought much about float imagery as input for the segmentation, but I can't find any reason why that should not work. If we accept that people may wish to do this, then we must also accept that sometimes their images will contain NaN or Inf values, even if that is not explicitly the null value. We should deal with these directly, and exclude them from calculations. Any such pixels should be placed in the null segment, in the same way as pixels with the null value. Currently they would be dragged into the calculations and behave badly.

As @eikeschuett points out, we should also deal with the special case where the null value is NaN, but it is a wider problem than just that case.

I do vaguely recall that @petescarth was running this with float inputs ages ago - is that right? Were there any other problems?

Any thoughts you have are welcome, please speak up. I will start a pull request with my attempt to address this.

I note that currently all our doc strings say that the input image array is of integers, so I guess we should change that, too.

gillins commented 1 year ago

I think the problem that @petescarth has was calculating per segment stats on a float image. Our "collect the histogram" approach fell down in this case. We now refuse to process stats for float images.

Regarding using a float image as input to segmentation - yes we should probably cope. Happy to work through this with you when I'm back on deck.

neilflood commented 1 year ago

Thanks @gillins ! That explains - I had a feeling he had run into problems with something, but could not put my finger on it. That makes sense :-)

petescarth commented 1 year ago

That's right - for running the segmentation and for calculating statistics, I have always scaled any float inputs to integer.

https://github.com/ubarsc/pyshepseg/issues/19 was about handling no data in the per-segment calculations, but still for the integer case.

Floating point data as an input to the segmentation makes sense, but I can't quite see how the stats would work given the current speed histogram method.

neilflood commented 1 year ago

Yes, which means that one could calculate a segmentation on a float input, but then be unable to calculate per-segment statistics on the same input image. That might still be useful sometimes, but would also be quite annoying.

Hmmm.... let's wait for comment from @eikeschuett before we get too enthusiastic.

gillins commented 1 year ago

I've just realised that the "spatial" stats option doesn't need the integer restriction as I store all the pixel values for the segment (plus the coords). So this could be a way forward for float images (although I suspect there is currently a check copied from the non-spatial stats that stops this). Uses more memory though...

eikeschuett commented 1 year ago

Wow, thanks for your enthusiasm! And yes, I'm using pyshepseg on float data. To be honest, I didn't realized that only integer arrays are fully supported. In fact, my results on some Pleiades images look very promising, but I didn't tested everything in detail..

neilflood commented 1 year ago

Thanks @eikeschuett

After that discussion above, it does look like a full fix for these issues around using float data in pyshepseg will require a bit of work, and some significant re-factoring of the code. So, I am going to put this on hold for the moment, while we discuss:

I don't think we will be rushing into any of this, so it may take a while. In the meantime, could I ask you to just convert your input files to integer, with a suitable null value, and see how useful you find the whole package? I would be interested to hear more when you have had a chance to do some more serious experimentation, and get your feedback on whether there would be a large benefit to being able to use float inputs, or just a small saving in convenience.

Is that OK?

eikeschuett commented 1 year ago

@neilflood, that sound like a good idea. I'll convert my data to integers and will keep you posted how it works!