Re-run canopy cover - Githubissues

max-zilla commented 5 years ago

Discussions with @ZongyangLi and myself yielded several updates to canopy cover:

add thresholding so images below quality threshold are not processed (avoid bad values e.g. May 7)
on fieldmosaic step, set "NoData" value separate from the 0 that denotes soil in soil masked images
rerun RGB images for 05/07 and RGB_mask
regenerate fieldmosaic using the NoData fix to generate a % of the visible area for partial plot scans
resubmit canopycover to bety with fixed values

max-zilla commented 5 years ago

check if geotiff supports NoData or null, if not can we encode -99 as standard NoData value

max-zilla commented 5 years ago

alpha band for opacity, or a synthetic band 0/1 representing soil, categorized band with multiple settings (NoData, Soil, Mask)

Can we assign a NoData value to VRT before translating to geotiff? it's possible that the source file not having NoData is resulting in (0,0,0)

dlebauer commented 5 years ago

Add documentation, tests, if an OGC standard for encoding missing data

max-zilla commented 5 years ago

-a_nodata NoData works for gdal_translate.

max-zilla commented 5 years ago

After lots of different experimentation, gdal_translate seems to conflate NoData and 0,0,0 pixel values during the VRT -> TIF conversion process regardless of VRT NoData settings given to gdalbuildvrt command, or parameters given to gdal_translate.

I set that aside in order to get fundamental process working, and am using -add_alpha flag in gdalbuildvrt to make the fullfield mask an RGBA image instead of RGB, with alpha=255 where the photos exist in the image and alpha=0 where no data exists (between rows) leaving the 0s for soil removed from photos intact. Just modified and tested cc algorithm on a small fieldmosaic of 5 images:

...got a cc value of 98.84% for the whole image. Deploying test on actual fullfield date next, then can rerun all cc data over weekend if it looks correct.

max-zilla commented 5 years ago

Currently still running but spot checks are looking good: range 52 (top) and 51 (bottom), columns 9-14

          9 10 11 12 13 14
    ------------------
ROW 52 - 29 36 91 89 75 29
ROW 51 - 87 87 90 84 86 87

These percentages are much closer to what one would expect. I've applied a NoData maximum of 75% (larger than it was before) to push partialplots scans through the pipe, so things like Column 16 will be omitted in those cases:

Expect it to finish Friday or Saturday.

max-zilla commented 5 years ago

Some more QA tidbits...

on 2018-07-01, 3 of the 5 scans run were full field. The same scan with 'shade' in the name was run twice, and a third scan with 'sun' in the name.
```
shade  average CC, all plots - 86%
shade2 average CC, all plots - 84%
sun    average CC, all plots - 97%
```
The sunlit scans are around ~10% higher consistently: The two shade scans were fairly consistent on average, but a small handful of plots (19/766) had differences above 10% for the shade scans. This is likely attributable to the sensitivity in our rgb_mask algorithm:

Shade2 Mask - 57.8% Screen Shot 2019-05-23 at 10 46 18 AM

Shade1 Mask - 81% , this was also a 2-pass partialplot scan instead of 1-pass (more coverage) Screen Shot 2019-05-23 at 11 00 08 AM

Sun Mask - 85%, nice and bright, more pixels retained here. Screen Shot 2019-05-23 at 10 46 37 AM

I would argue that this doesn't merit further delays for more reprocessing, but it'll be important for data consumers to understand this kind of phenomena when we have multiple differing CC values per-day.

Maybe simplest suggestion is just use the maximum observation for a given day, I doubt over-estimation will be a common problem except in rare cases where e.g. a reflectance test panel or something is on the dirt field and reads as > 0 canopy cover.

dlebauer commented 5 years ago

We shouldn't use a 'max' per day as a workaround for an algorithm that doesn't function as expected.

The best way to fix the problem is to fix the algorithm. But that may take a while to fix.

Otherwise, if the data are known to be in error, e.g. if the algorithm can't handle sunlit scans, then we shouldn't include that data in the database.

In the end, If we have three measurements from a day then having a single that under-estimates by 10% is a small issue.

max-zilla commented 5 years ago

That sounds good. To be clear, there are 2 shady scans and 1 sunny - the sunny seems to correctly report the higher value, and the 2 shady scans seem to be under-estimating. Adjusting the RGB mask thresholds could address this case, but it could have other repercussions. Not sure if I'd go so far as to call it an error.

dlebauer commented 5 years ago

sorry I got that backward. If it isn't as far as an error, I think adding this caveat to the documentation (README) under known limitations would be okay.

dlebauer commented 5 years ago

@ZongyangLi and @abby621 can these exceptions be added as test cases to the extractor?

max-zilla commented 5 years ago

I started uploading the CSVs to bety and noticed a small number of files from May were being omitted from the field mosaics so I paused the upload. Closer examination revealed the omitted mask images had a different TIF header than the majority:

Band 1 Block=2472x1 Type=Float32, ColorInterp=Gray
Band 2 Block=2472x1 Type=Float32, ColorInterp=Undefined
Band 3 Block=2472x1 Type=Float32, ColorInterp=Undefined

The data type was Float32 and the RGB color bands aren't properly indicated (the data itself is fine). But these headers meant GDAL rejected them from the VRT creation because they differed from the other expected header data:

Band 1 Block=2472x1 Type=Byte, ColorInterp=Red
Band 2 Block=2472x1 Type=Byte, ColorInterp=Green
Band 3 Block=2472x1 Type=Byte, ColorInterp=Blue

I'm not sure why the headers are different - perhaps the small number of May files were generated with an older version of the extractor and didn't get re-run properly. The good news is that the fix to data type and RGB header is a simple GDAL command:

gdal_translate -ot Byte -colorinterp red,green,blue source_file out_file
rm source_file
mv out_file source_file

This forces the output file to have properly registered RGB channels and data type.

I'm running a small script to correct these, but it looks like the issue doesn't occur later so I will proceed to upload the remaining CSVs in the meantime. I don't anticipate this will impact the results being sent to bety, we just might get some more plots from early May scans once done.

Should be able to close this issue then, and upload a few of the test images I've been using for future checks.

max-zilla commented 5 years ago

created #590 to follow on QA process for this.

terraref / computing-pipeline

Re-run canopy cover #572