Refactor - Githubissues

aazuspan commented 1 year ago

This is a large refactor of the sampling, training, and prediction workflows. The goal is to move most of the reusable code out of notebooks and into Python modules to clean things up. In the process, we should be able to add some additional flexibility to make it easier to predict different LiDAR attributes, train with different models and band combos, etc.

Very much in progress!

TODO

[x] Split out sampling into a Python module.
[x] Refactor the sampling notebook.
[x] Build data loading API.
- ~I was planning to use tfio.IODataset.from_parquet to load PNG byte strings and parse them back into images, but that is currently blocked by https://github.com/tensorflow/io/issues/1575. I may need to consider other options. Other possibilities are 1) eager loading Parquet with pandas (slow and won't scale well), 2) going back to TFRecords (complex), 3) using Dask dataframes (more complex), 4) trying another format with TFIO like Feather (worse compression), or 5) using image chips (potentially slow).~ Currently using HDF5 via tfio.IODataset.from_hdf5. This maintains the advantage of lazy evaluation, avoids the corrupted image issue with Parquet, and has the benefit of being able to store n-dimensional array features so that we don't need to encode our NAIP footprints into PNGs before storing them.
[x] Refactor the modeling notebook for the new data API.
[x] Refactor the inference workflow.

aazuspan commented 1 year ago

@grovduck Just a heads up in case you wanted to follow along with progress, this is where I'll be focusing development for now.

I wrote up a quick contributing guide (mostly so I don't forget how I got Docker working) that might be helpful if you feel like running things. The sampling module and notebook should be relatively stable now, but everything else is still deeply in progress!

Feel free to engage at whatever level of interest/amount of free time you have! I'm always open to feedback on design choices, but just knowing someone else might peruse the code should motivate me to write things a little more cleanly out of shame 😉

grovduck commented 1 year ago

@aazuspan, definitely will want to follow along! I'll try to get Docker set up on my machine and see what you've done here so far. I'll try to provide some feedback here soon. Thanks for including me!

aazuspan commented 1 year ago

Awesome, let me know if you run into any issues with the setup. This is my first attempt at using Docker to manage a project, and I'm about 60% sure I'm doing it right...

grovduck commented 1 year ago

Hey @aazuspan, stepping through the installation of Docker (all new to me) and there are a couple of places where I think some additional detail in CONTRIBUTING.md might help:

On step 2 (installing Docker Desktop), I think it would be helpful for the user to know that Docker Desktop will go through an installation process (outside VSCode). VSCode provides a message that this is going to happen, but it takes a while before the actual installation started for me. Once installed, I had to log out of Windows and accept the subscription service agreement on restart. Most people will likely figure this out, but being explicit might help.
On step 3 (setting up WSL2 on Windows), it might be helpful to add a note about what the user should do if they are already using WSL, but Ubuntu is not their default distro. This was my situation - I have a Debian distro (for the GNN download site) that was set as default and I switched it using wsl --set-default Ubuntu. This is spelled out in the guide that you point to, so maybe not essential, but I had to think about this for a bit.
On step 3 (and actually in the Docker Desktop instructions), it wasn't immediately clear (to Docker newbs like me) where you were supposed to run the wsl commands, through a normal Windows command shell or somehow through the Docker Desktop application. Eventually, I figured out that it was the former and that the setup of WSL is outside of Docker Desktop.
On step 5 (reopening the project in the container), it may be helpful for the user to know that the process can take a little while to complete.
When checking the setup, it may freak users out to see that they are running as root. I assume this is OK, but might be useful to let users know
I'm pretty sure this last one is a corner case, but I actually have two GPUs on my machine, one of which is not compatible with CUDA, I believe. When I run assert tf.config.list_physical_devices('GPU'), I'll get a warning that one of my GPUs is being ignored. It looks like I can run export CUDA_VISIBLE_DEVICES=<list of acceptable GPU device numbers> at the linux prompt and it won't give me this warning
Even when I do filter the GPUs, I still get warning messages that "Your kernel may have been built without NUMA support" on first run. It looks like this may just be a warning that can be ignored, but I haven't dug too deep.

Please take all these suggestions with a grain of salt. There probably won't be a lot of users trying to do this and it may not be worth the effort. Most of these suggestions are extra "hand-holding" for the user.

aazuspan commented 1 year ago

Thanks for the code review and the notes on the Docker setup! I'll start incorporating those. As you found out, things are pretty broken at the moment beyond the sampling stage, so I'll keep you posted as new pieces are ready for review.

Also, let me know if you have any thoughts/ideas on the topic of storing training data. I mostly moved away from TFRecords because I find them very confusing to work with and wanted to avoid the hassle of parsing data into and out of them, and my understanding is that their performance benefits are negligible at the scale we're likely to be working (or even negative as they have some overhead). I doubt the tensorflow_io issue is going to be resolved any time soon though, so I'm guessing we're going to need an alternative, whether that's another supported format, a different method for reading in Parquet, just storing chips, or going back to TFRecord.

aazuspan commented 1 year ago

Also, a general question. You're using hatch here - I assume I should set that up within the container on Ubuntu? Is that something that can be automated as part of the Dockerfile? I've never worked with Docker files before, so newbie questions.

At this point I don't think there's any need to use a Hatch environment, since we're already getting our isolated environment and dependencies from Docker. I used it because I figured we might want to add some Hatch scripts later, at which point we could install it through the Dockerfile.

On step 3 (setting up WSL2 on Windows), it might be helpful to add a note about what the user should do if they are already using WSL, but Ubuntu is not their default distro. This was my situation - I have a Debian distro (for the GNN download site) that was set as default and I switched it using wsl --set-default Ubuntu.

I actually only specified Ubuntu because that's what I used... You didn't happen to try Docker with your Debian install, did you? I imagine that would work, but not sure. In any case, explicitly recommending Ubuntu for consistency is probably a good idea.

Even when I do filter the GPUs, I still get warning messages that "Your kernel may have been built without NUMA support" on first run. It looks like this may just be a warning that can be ignored, but I haven't dug too deep.

I get this too and came to the same conclusion, but also didn't dig too deep.

grovduck commented 1 year ago

Also, let me know if you have any thoughts/ideas on the topic of storing training data. I mostly moved away from TFRecords because I find them very confusing to work with and wanted to avoid the hassle of parsing data into and out of them, and my understanding is that their performance benefits are negligible at the scale we're likely to be working (or even negative as they have some overhead). I doubt the tensorflow_io issue is going to be resolved any time soon though, so I'm guessing we're going to need an alternative, whether that's another supported format, a different method for reading in Parquet, just storing chips, or going back to TFRecord.

You're way ahead of me on data formats - I barely know the TFRecord format and next to nothing about Parquet. The only possible advantage to TFRecords that I can think of is the tight integration with GEE, but I think you're moving away from that possibly as well. I thought the serialization of the arrays into PNG was pretty clever, so that would be great if that works, but I understand you may need to reevaluate based on the outstanding issue.

grovduck commented 1 year ago

You didn't happen to try Docker with your Debian install, did you?

I didn't try to set up with Debian, but I can do this if you'd like to test. I think just using Ubuntu is fine though.

aazuspan commented 1 year ago

@grovduck, I just pushed the updated data module and training notebook! I switched from Parquet to HDF5, which seems to be fully functional in tfio and has the added advantage of being able to store our NAIP arrays as-is without encoding, which saves a step when reading in datasets that should reduce training time a bit.

If you get a chance to run everything and train a model, I'm curious to hear if training time is much faster for you now. I'm still wrapping my head around Tensorflow datasets, but from the official examples I saw (e.g. this one), I don't believe there's a reason to use repeat(NUM_EPOCHS) on the dataset as we were previously doing, so I removed that. That change caused each epoch to train for 375 steps (24,000 training samples // 64 samples per batch) instead of 7500 steps (375 batches * 20 epochs), which obviously runs much faster. I never ran training with the original TFRecord training data, so maybe that repeat was necessary with that solution.

Note that you'll need to re-run the sampling notebook to store the training data in the new format.

The only possible advantage to TFRecords that I can think of is the tight integration with GEE

I didn't notice this until I started refactoring, but we weren't really taking advantage of the GEE support for TFRecords during sampling since we were just building the TFRecords client-side after grabbing the NAIP data from a collection of ee.Dictionary objects, so I don't think we lost any compatibility there.

We definitely may still want to use TFRecords for the mapping workflow, since GEE can chunk images into TFRecords for us, which might save some hassle over downloading GeoTIFFs and chunking ourselves (although I'm also curious to try xbatcher). If we ever wanted to operationalize the model and run inference in the cloud, I think you're right that there would be an advantage to using a TFRecord system for inference.

Also, just FYI, I experimented with the Planetary Computer idea that I floated and found it was MUCH slower than this GEE solution. Lesson learned!

I think just using Ubuntu is fine though.

Agreed! I added a quick disclaimer to the contributing guide that other distros may work, but are untested.

grovduck commented 1 year ago

If you get a chance to run everything and train a model, I'm curious to hear if training time is much faster for you now

Yes, I did test both the sampling and training notebooks after you had made the HDF change and training was indeed much faster for me (the cell took 1m30s to evaluate - I think previously it was at least 15 minutes?).

I'm still wrapping my head around Tensorflow datasets, but from the official examples I saw (e.g. this one), I don't believe there's a reason to use repeat(NUM_EPOCHS) on the dataset as we were previously doing, so I removed that.

This would definitely be a question for Adam. Based on his original comment, he thought it was "mysterious" as well, but I think he was referencing some tutorials that he had found. The only thing I've found is that repeat ensures that the model trains on the entire dataset for NUM_EPOCHS, but based on your explanation, it sounds like each epoch sees the full dataset with your change.

We definitely may still want to use TFRecords for the mapping workflow, since GEE can chunk images into TFRecords for us

Yes, this is really what I meant, as I realized Adam didn't use GEE for the initial sampling scripts, but only during the inference step. xbatcher looks pretty interesting as well (I really need to understand xarray better!).

Also, just FYI, I experimented with the Planetary Computer idea that I floated and found it was MUCH slower than this GEE solution. Lesson learned!

Was this part of #1 or some other experimentation that you did?

aazuspan commented 1 year ago

I think he was referencing some tutorials that he had found. The only thing I've found is that repeat ensures that the model trains on the entire dataset for NUM_EPOCHS, but based on your explanation, it sounds like each epoch sees the full dataset with your change.

@sibley Any insights you can share on the use of repeat before training with the dataset?

Was this part of https://github.com/lemma-osu/naip-cnn/issues/1 or some other experimentation that you did?

This was an earlier undocumented experiment. If I remember right, my Planetary Computer workflow took a few hours to pull the 5k chips that we can grab from EE in about a minute. There's probably a more efficient PC solution than what I came up with, but having to make a separate network request for each chip killed any speed advantage that came from being able to access the GeoTIFFs directly.

sibley commented 1 year ago

looks like a lot has gone on here! this is awesome. what if we did an AMA zoom call and you guys could point me to spots in the code where you have questions?

aazuspan commented 1 year ago

I finally figured out the repeat question I was confused about, but thanks for the offer, Adam! I haven't dug back into the mapping code yet and I'm expecting to get confused by the TFRecord parsing, so I might reach out again for that AMA.

@grovduck if you're curious, it looks like repeat used in combination with steps_per_epoch used to be a necessary workaround due to an old Tensorflow bug. I found this response that sheds some light on the issue. It's talking about using an infinite repeat, but the same logic would apply to a finite repeat:

...using repeat(None) for train_set means that the training set will be an infinitely repeating dataset. Since we don't training to last forever, we must specify the number of steps_per_epoch and the number of epochs when calling the fit() method.

A simpler way to do this nowadays is to avoid repeating the training set (don't call repeat() at all), then avoid using steps_per_epoch. Indeed Keras will train the model perfectly well: it will just go through the entire dataset at each epoch. The reason why I used the first approach in the book is that there used to be a bug in Keras in TF 2.0 which made it impossible to use the simpler approach: you just had to use the first approach.

We could keep the repeat and just specify steps_per_epoch when we fit, but it sounds like omitting both is now best practice.

grovduck commented 1 year ago

LGTM!

lemma-osu / naip-cnn

Refactor #2

TODO