Predict 2D labels - Githubissues

aazuspan commented 1 year ago

This would close #5 by updating the sampling and training workflows to use 2D label data instead of single values. This adds flexibility by allowing us to work with arbitrary NAIP image sizes instead of being tied to a single 30m pixel. To accomplish this, there were some major changes made:

Switch from sampling local LiDAR files to sampling LiDAR images from Earth Engine. This was done primarily to ensure that we could reliably align footprints to the NAIP data in Earth Engine, but also offloads the extraction work to the cloud. A script was added to pre-process the single-band LiDAR rasters into multi-band rasters which can be uploaded to image collections in Earth Engine. That ingestion could be automated with a tool like geeup, but I figured it should be a relatively rare task, so it is currently just done manually.
A new Acquisition class was added to handle the loading of LiDAR and NAIP images from Earth Engine. This also utilizes LCMS annual forest loss maps to mask any loss that occurred between LiDAR and NAIP acquisitions.
The sampling module was updated to sample from Earth Engine Point geometries instead of x, y coordinates.
For now, the sampling notebook has been modified to extract all undisturbed pixels at the specified minimum spacing over the acquisition footprint, rather than a random sample. These can be subsampled later if desired.
Previously, sampled values were extracted on-the-fly using parallel getInfo calls. Now that we're extracting exponentially more data at each sample, this isn't possible as even small requests lead to memory errors. Instead, we now have to export sampled results to Drive via CSV and reformat the CSVs into HDFs.
The Dataset class (now NAIPDataset) was refactored to work with 2D label data. It was also simplified such that the acquisition year is now part of the identifying dataset name, and the number of samples is no longer an identifier because it isn't a user-defined parameter when building the dataset.

Preliminary results

This was far from a controlled experiment, but results look promising. The table below compares canopy cover results from the baseline main branch model and dataset, and with the addition of the new model and new data. Note this comparison was originally run between 2010 and 2016[^2] data, but has been updated to use the MAL2010 acquisition for everything.

Model	Dataset	n	MSE	R^2	Notes
CNN_v1	1D - MAL2010	30k	201	0.55	Baseline
UNet	1D - MAL2010	30k	195	0.57	New model, old data
CNN_v1	2D - MAL2010	450k	223	0.54	New data, old model
UNet	2D - MAL2010 subsample[^1]	30k	194	0.58	New model and data (same data quantity)
UNet	2D - MAL2010	450k	161	0.65	New model and data (proposed)

The takeaway seems to be that the new model and dataset alone didn't improve performance, but the combination of the two did. The new dataset improved performance after accounting for the larger sample size[^1], suggesting that the increased dimensionality contributed to the improvement.

[^1]: The updated 2D dataset contains only 18,000 samples, but each sample contains 25 measurements, giving us ~15x as much data as the old dataset. The subsampled dataset contains only 1,200 samples with 25 measurements each, giving us 30,000 total LiDAR measurements. [^2]: With the MAL2016 dataset and the proposed model and data structure, we got MSE 139 and R^2 0.78, suggesting that performance varies quite a bit between datasets.

aazuspan commented 1 year ago

@grovduck If you have some time and want to try running through all the updated notebooks, that would be great, but if not, no worries! Fair warning that with the updated 150x150m image sizes, the exports take much longer and produce much larger files (the 2016 training CSV used here is ~7GB).

It will take some more experimentation to decide if the larger label size has a benefit outside of the increased amount of training data, but I think that regardless we'll want to use this updated workflow with 2D label data, whether we go back to just 1x1 labels at 30m or go all the way to 150x150 labels with 1m LiDAR data.

p.s. Sorry for the terribly messy commit history! Once again I accidentally included a lot of old commits that got squashed into main in this PR...

grovduck commented 1 year ago

@aazuspan, I think I have a silly first question. When running 01_sampling.ipynb, I was expecting that I wouldn't need to authenticate because of the fix you made in #10. But, is this code forcing the reauthentication?

import ee
ee.Authenticate()
ee.Initialize()

Should ee.Authenticate pass silently if the local credentials are already in place? The rest of the notebook was a dream :smile:

aazuspan commented 1 year ago

Good question! ee.Authenticate does force re-authentication even if you already have credentials.

Those calls are mostly just left over from before #10, but I was also a little hesitant to remove them in case a user tried to run the notebooks without local credentials. Now that I think about it though, I'm leaning towards removing them all and just making a note in the first notebook to run ee.Authenticate if necessary. What do you think? In the meantime I've just been commenting them out whenever I run those cells, which definitely isn't the right solution 😅

grovduck commented 1 year ago

In the meantime I've just been commenting them out whenever I run those cells.

Heehee, me too!

I like your proposed solution. Get authentication out of the way at the beginning (assuming that a user will run the notebooks in order) and then just have ee.Initialize() calls. I guess the other option is a helper function used everywhere that tests for initialization and then authenticates if an exception is thrown? But not sure it's worth the effort especially if most users will be authenticated already.

aazuspan commented 1 year ago

The helper function is a cool idea, but I agree probably not worth the effort here. I removed the authenticate calls from all notebooks and added a note to the sampling notebook. I also think that trying to initialize without authenticating will give a helpful error, so it should be easy enough to get running in the off chance someone uses this who isn't already authenticated.

grovduck commented 1 year ago

I think that regardless we'll want to use this updated workflow with 2D label data, whether we go back to just 1x1 labels at 30m or go all the way to 150x150 labels with 1m LiDAR data

I definitely agree with this. So much flexibility here.

grovduck commented 1 year ago

@aazuspan, just noticed something with the output window of the predicted GeoTiffs that may or may not be an issue. Presumably, when you're collecting the training data, I think you're using the crsTransform of the lidar data to set the window for the collection of the chips. When you export the NAIP data for inference, because it doesn't look like a crsTransform is explicitly set, it will use the top left corner(?) to set it. So the predicted raster will very likely have a different crsTransform than the lidar (this is the case with the 2016 sample data).

There is nothing at all wrong with this - the model should be equally applicable to any set of NAIP pixels - but I wonder if you want to offer the option of setting a crsTransform alongside the CRS such that predicted rasters will be snapped to either a custom window or the input lidar data. For example, if we were using these outputs in further modeling, it would be helpful to specify the crsTransform to align with our other covariates. This might be worth a separate issue or might just be a post-processing step.

aazuspan commented 1 year ago

@grovduck This is a good point. I haven't put a lot of thought into the exporting process yet, and there will certainly be some big changes needed before this is ready for production.

With the crsTransform issue in particular, am I understanding right that this could be fixed just by setting the crsTransform when starting the export in 03_export_naip.ipynb? That should store the transform in the mixer.json, which would then get applied during the inference step. Do you think there's something we can add to the package to make that easier, or do we just need to modify the notebook?

grovduck commented 1 year ago

With the crsTransform issue in particular, am I understanding right that this could be fixed just by setting the crsTransform when starting the export in 03_export_naip.ipynb? That should store the transform in the mixer.json, which would then get applied during the inference step. Do you think there's something we can add to the package to make that easier, or do we just need to modify the notebook?

I guess my thought would be this cascade:

If CRS_TRANSFORM is defined in config.py, use that preferentially (this is probably what I would do to ensure that all predictions lined up with other covariates in further modeling.)
(This one might be tricky because the lidar dataset isn't referenced in 03_export.ipynb). Use the crsTransform of the lidar data that was used to fit the model. This ensures the same snap to the lidar data for prediction and comparison.
If neither 1. or 2. is present, use the (projected) corner of the BBOX to set the transform.

If you can think of a helper function that could apply this logic, that would be great to add to the package, but this is definitely not necessary and if I want it, I can put in the notebook. I don't want to hang up this PR on this tiny issue.

aazuspan commented 1 year ago

Good thoughts on CRS transforms! In the interest of getting this merged, I opened #17 to track that separately. I'll respond to your comments there once I've spent a little more time thinking them through.

With the basic shuffling added in f5cbd08, it looks like everything here is resolved, so I'm going to go ahead and merge! I think we may want to rework the shuffling approach for cleanliness/performance in the future, but this should work for a first pass.

grovduck commented 1 year ago

Looks great to me, @aazuspan. Sorry for the delay in responding.

lemma-osu / naip-cnn

Predict 2D labels #11

Preliminary results