Closed aazuspan closed 1 year ago
@grovduck If you have some time and want to try running through all the updated notebooks, that would be great, but if not, no worries! Fair warning that with the updated 150x150m image sizes, the exports take much longer and produce much larger files (the 2016 training CSV used here is ~7GB).
It will take some more experimentation to decide if the larger label size has a benefit outside of the increased amount of training data, but I think that regardless we'll want to use this updated workflow with 2D label data, whether we go back to just 1x1 labels at 30m or go all the way to 150x150 labels with 1m LiDAR data.
p.s. Sorry for the terribly messy commit history! Once again I accidentally included a lot of old commits that got squashed into main
in this PR...
@aazuspan, I think I have a silly first question. When running 01_sampling.ipynb
, I was expecting that I wouldn't need to authenticate because of the fix you made in #10. But, is this code forcing the reauthentication?
import ee
ee.Authenticate()
ee.Initialize()
Should ee.Authenticate
pass silently if the local credentials are already in place? The rest of the notebook was a dream :smile:
Good question! ee.Authenticate
does force re-authentication even if you already have credentials.
Those calls are mostly just left over from before #10, but I was also a little hesitant to remove them in case a user tried to run the notebooks without local credentials. Now that I think about it though, I'm leaning towards removing them all and just making a note in the first notebook to run ee.Authenticate
if necessary. What do you think? In the meantime I've just been commenting them out whenever I run those cells, which definitely isn't the right solution 😅
In the meantime I've just been commenting them out whenever I run those cells.
Heehee, me too!
I like your proposed solution. Get authentication out of the way at the beginning (assuming that a user will run the notebooks in order) and then just have ee.Initialize()
calls. I guess the other option is a helper function used everywhere that tests for initialization and then authenticates if an exception is thrown? But not sure it's worth the effort especially if most users will be authenticated already.
The helper function is a cool idea, but I agree probably not worth the effort here. I removed the authenticate calls from all notebooks and added a note to the sampling notebook. I also think that trying to initialize without authenticating will give a helpful error, so it should be easy enough to get running in the off chance someone uses this who isn't already authenticated.
I think that regardless we'll want to use this updated workflow with 2D label data, whether we go back to just 1x1 labels at 30m or go all the way to 150x150 labels with 1m LiDAR data
I definitely agree with this. So much flexibility here.
@aazuspan, just noticed something with the output window of the predicted GeoTiffs that may or may not be an issue. Presumably, when you're collecting the training data, I think you're using the crsTransform
of the lidar data to set the window for the collection of the chips. When you export the NAIP data for inference, because it doesn't look like a crsTransform
is explicitly set, it will use the top left corner(?) to set it. So the predicted raster will very likely have a different crsTransform
than the lidar (this is the case with the 2016 sample data).
There is nothing at all wrong with this - the model should be equally applicable to any set of NAIP pixels - but I wonder if you want to offer the option of setting a crsTransform
alongside the CRS
such that predicted rasters will be snapped to either a custom window or the input lidar data. For example, if we were using these outputs in further modeling, it would be helpful to specify the crsTransform
to align with our other covariates. This might be worth a separate issue or might just be a post-processing step.
@grovduck This is a good point. I haven't put a lot of thought into the exporting process yet, and there will certainly be some big changes needed before this is ready for production.
With the crsTransform
issue in particular, am I understanding right that this could be fixed just by setting the crsTransform
when starting the export in 03_export_naip.ipynb
? That should store the transform in the mixer.json
, which would then get applied during the inference step. Do you think there's something we can add to the package to make that easier, or do we just need to modify the notebook?
With the
crsTransform
issue in particular, am I understanding right that this could be fixed just by setting thecrsTransform
when starting the export in03_export_naip.ipynb
? That should store the transform in themixer.json
, which would then get applied during the inference step. Do you think there's something we can add to the package to make that easier, or do we just need to modify the notebook?
I guess my thought would be this cascade:
CRS_TRANSFORM
is defined in config.py
, use that preferentially (this is probably what I would do to ensure that all predictions lined up with other covariates in further modeling.)03_export.ipynb
). Use the crsTransform
of the lidar data that was used to fit the model. This ensures the same snap to the lidar data for prediction and comparison.BBOX
to set the transform.If you can think of a helper function that could apply this logic, that would be great to add to the package, but this is definitely not necessary and if I want it, I can put in the notebook. I don't want to hang up this PR on this tiny issue.
Good thoughts on CRS transforms! In the interest of getting this merged, I opened #17 to track that separately. I'll respond to your comments there once I've spent a little more time thinking them through.
With the basic shuffling added in f5cbd08, it looks like everything here is resolved, so I'm going to go ahead and merge! I think we may want to rework the shuffling approach for cleanliness/performance in the future, but this should work for a first pass.
Looks great to me, @aazuspan. Sorry for the delay in responding.
This would close #5 by updating the sampling and training workflows to use 2D label data instead of single values. This adds flexibility by allowing us to work with arbitrary NAIP image sizes instead of being tied to a single 30m pixel. To accomplish this, there were some major changes made:
Acquisition
class was added to handle the loading of LiDAR and NAIP images from Earth Engine. This also utilizes LCMS annual forest loss maps to mask any loss that occurred between LiDAR and NAIP acquisitions.sampling
module was updated to sample from Earth EnginePoint
geometries instead of x, y coordinates.getInfo
calls. Now that we're extracting exponentially more data at each sample, this isn't possible as even small requests lead to memory errors. Instead, we now have to export sampled results to Drive via CSV and reformat the CSVs into HDFs.Dataset
class (nowNAIPDataset
) was refactored to work with 2D label data. It was also simplified such that the acquisition year is now part of the identifying dataset name, and the number of samples is no longer an identifier because it isn't a user-defined parameter when building the dataset.Preliminary results
This was far from a controlled experiment, but results look promising. The table below compares canopy cover results from the baseline
main
branch model and dataset, and with the addition of the new model and new data. Note this comparison was originally run between 2010 and 2016[^2] data, but has been updated to use the MAL2010 acquisition for everything.The takeaway seems to be that the new model and dataset alone didn't improve performance, but the combination of the two did. The new dataset improved performance after accounting for the larger sample size[^1], suggesting that the increased dimensionality contributed to the improvement.
[^1]: The updated 2D dataset contains only 18,000 samples, but each sample contains 25 measurements, giving us ~15x as much data as the old dataset. The subsampled dataset contains only 1,200 samples with 25 measurements each, giving us 30,000 total LiDAR measurements. [^2]: With the MAL2016 dataset and the proposed model and data structure, we got MSE 139 and R^2 0.78, suggesting that performance varies quite a bit between datasets.