microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.71k stars 334 forks source link

Jupyter Notebook tutorials #38

Closed adamjstewart closed 3 years ago

adamjstewart commented 3 years ago

We need to figure out how to render Jupyter Notebooks in our documentation so that we can provide easy-to-use tutorials for new users. This should work similarly to https://pytorch.org/tutorials/.

Ideally I would like to be able to test these tutorials so that they stay up-to-date.

adamjstewart commented 3 years ago

Here is another example of how to do this: https://github.com/PyTorchLightning/lightning-tutorials

adamjstewart commented 3 years ago

Started looking into this. You can directly render a notebook using nbsphinx. pandoc is required if you have any markdown in your notebook. This is what Lightning does for their tutorials.

However, PyTorch does something completely different. They instead store the file as a .py file and encode the rst in comments. I'm guessing this makes it easier to test? Not sure how this gets automatically converted to a notebook when you open in Google Colab/MS Learn.

TomAugspurger commented 3 years ago

Another option is nbsphinx: https://nbsphinx.readthedocs.io/en/0.8.7/ That's what's used in dask-examples: https://github.com/dask/dask-examples, which are rendered at https://examples.dask.org/.

adamjstewart commented 3 years ago

On the thread of running and testing notebooks to make sure they remain up-to-date, nbmake seems like a good way to integrate things with pytest: https://semaphoreci.com/blog/test-jupyter-notebooks-with-pytest-and-nbmake

However, that requires a specific conda environment to be active, which I don't want. Also, we'll need all dependencies installed and have data available. Some of these training loops could be very time-intensive to run.

adamjstewart commented 3 years ago

Okay, here's what I've decided. We'll use nbsphinx to render the tutorial notebooks and nbmake to test them. Tests will be split into:

This will allow us to iterate quickly on PRs without inundating CI but still make sure that the entire stack including data download and model training works as expected before each release. We'll move testing of setup.py and train.py to the integration/functional tests, which will greatly speed those up as well.

adamjstewart commented 3 years ago

Another possibility instead of downloading the data ourselves is to use existing datasets in the cloud. I don't think Google Colab has access to any satellites imagery, and the Planetary Computer is not yet available to the general public. Are there any other cloud services that could work?

Geethen commented 2 years ago

Google Earth Engine, there are a few datasets available for ML as of now. BigEarthNet LandCoverNet

I would be keen on assisting with this at some point

This also gave me an idea to contribute more datasets to the community catalog

adamjstewart commented 2 years ago

Does GEE support running jupyter notebooks? I've only ever used JavaScript in their code editor. It's hard to make any assumptions about data availability since the notebook needs to run on Colab, PC, and CI.

Geethen commented 2 years ago

Yes, it does via the GEE Python API.

Some drawbacks

  1. The interactive leafmap/folium map will not stay alive. So you would have to opt for static images (that will need to be downloaded)
  2. Perhaps more serious, is the user would need to download the data to their drive (or GD or GCS) which has been made easier with geedim for image data (currently the workflow I use). However, I do not know of an instance where this can be avoided. I wonder if streaming the data as batches from GEE would be fast enough, or how much of a delay that will introduce.
adamjstewart commented 2 years ago

Okay, so this would be no different than our current approach of downloading data from Planetary Computer. Just another source of data.

Geethen commented 2 years ago

My apologies. I think it is going to be the case on all platforms for the foreseeable future (until GEE directly supports NNs-likely not any time soon).

Side note: in the geedim package the author used an approach based on rasterio to write image patches in chunks. Perhaps useful for inference

adamjstewart commented 2 years ago

Even if GEE directly supported NNs, they wouldn't support TorchGeo, so they aren't really relevant to us other than a possible data source. It would be much more fruitful to be able to directly support data in Colab or Planetary Computer. There's some work in progress on the PC side, but I'm not sure what's available in Colab.

Geethen commented 2 years ago

Good point. Yeah, that is likely going to be the case.

At the same time, once it comes to taking algorithms to data, paywalls and restrictive quotas (Google colab (including pro) has become more restrictive) will hinder, if not, prevent the geospatial community from training models on global data. For the sake of torchgeo, we may (at some point) get away with demos on small datasets in something like colab + PC (hopefully sooner than later).

I havn't used PC much, but it felt like there is still huge ground left to catch up to GEE (understandable given that GEE has had a ~10-year head start- ). I am sure we all look forward to that day:)

Thanks for the exchange

On Fri, Sep 16, 2022 at 5:01 PM Adam J. Stewart @.***> wrote:

Even if GEE directly supported NNs, they wouldn't support TorchGeo, so they aren't really relevant to us other than a possible data source. It would be much more fruitful to be able to directly support data in Colab or Planetary Computer. There's some work in progress on the PC side, but I'm not sure what's available in Colab.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/issues/38#issuecomment-1249476684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBW5I3F6TBM6NUHPE4ZQZDV6SDS7ANCNFSM5AFO5DUA . You are receiving this because you commented.Message ID: @.***>

adamjstewart commented 2 years ago

Yep, GEE used to have a lot more data, although I think PC might have already caught up in that front. GEE is still far more user friendly and easier to scale, so it's winning for non-CS people. But GEE is also very limited because it doesn't support NNs. In that sense, GEE is ~10 years behind TorchGeo 😄

(the entire geospatial community is ~10 years behind the computer vision community, computer vision folks haven't used anything other than CNNs for over a decade)

We're hoping to provide something as easy as possible for geospatial researchers hoping to explore deep learning methods. Of course, TorchGeo isn't restricted to Colab or PC, you can use it on your laptop, supercomputer, or in the cloud (AWS, Azure, GCP, etc.). As long as you can get your hands on some data, and you can afford compute time, you can use TorchGeo.

Geethen commented 2 years ago

I don't think so. Have you seen all the datasets on GEE. Without the GEE community catalog there are still way more datasets that are stored on GEE. But, yes Google is slow in adding new datasets. I think even for CS people (and non-geo people), user-friendliness is welcome. There is (limited) support for NN on GEE, you could upload a trained NN model to GEE and then run inference in GEE. I have also seen a paper where someone coded some NN (from the ground up) in GEE.

TorchGeo is super useful for making it easier to build NNs (thanks for your work). Building traditional tabular models is still often the way to go because of how much easier it is. A lot of the time, the benefits of using a DNN are not worth the effort of gathering the data+labels required for these hungry models. My current mindset is to use an ensemble of both :)

There is one geospatial lab (Twitter handle: @xiaoxiang_zhu) I am aware of that is at the cutting edge of applying computer vision (+NLP) stuff for mapping.

On Fri, Sep 16, 2022 at 5:43 PM Adam J. Stewart @.***> wrote:

Yep, GEE used to have a lot more data, although I think PC might have already caught up in that front. GEE is still far more user friendly and easier to scale, so it's winning for non-CS people. But GEE is also very limited because it doesn't support NNs. In that sense, GEE is ~10 years behind TorchGeo 😄

(the entire geospatial community is ~10 years behind the computer vision community, computer vision folks haven't used anything other than CNNs for over a decade)

We're hoping to provide something as easy as possible for geospatial researchers hoping to explore deep learning methods. Of course, TorchGeo isn't restricted to Colab or PC, you can use it on your laptop, supercomputer, or in the cloud (AWS, Azure, GCP, etc.). As long as you can get your hands on some data, and you can afford compute time, you can use TorchGeo.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/issues/38#issuecomment-1249519451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBW5I56FWI2ACMQER6W5OTV6SIR5ANCNFSM5AFO5DUA . You are receiving this because you commented.Message ID: @.***>