kvos / CoastSat

Global shoreline mapping tool from satellite imagery
http://coastsat.space
GNU General Public License v3.0
668 stars 245 forks source link

Why did you decide to download the images? #322

Open 3enedix opened 2 years ago

3enedix commented 2 years ago

Hi all,

first of all, thanks for your incredible work, this toolbox is exactly what I need.

Okay, almost exactly. I would like to extract (sandy, muddy and mangrovy) shorelines worldwide, over a period of approximately 30 years. In order to avoid having to buy loads of harddrives and running the code hundreds of times, I was hoping that a shoreline could also be extracted from images stored only temporarily in a variable, and then loop to the next timestep/place.

Is there a special reason why you decided to download the images to a local storage? Do you think it would be possible to extract shorelines without downloading the images?

I am quite new to GEE and would appreciate every hint!

(Hope this is the right place for this question and it is not already answered in the other 213 issues...) Best wishes, Bene

kvos commented 2 years ago

hi @CharliesWelt , it's a good question and the right channel to discuss this topic.

The CoastSat package uses GEE to filter the image collections, select the bands of interest and crop the images to the region of interest, then download the .tif files. The analysis is then done locally, with python libraries like scikit-image, scikit-learn, shapely, GDAL etc... The advantage of this workflow is that we have full control on the image (pixel by pixel) and can extract the shoreline at sub-pixel resolution, optimise the thresholding algorithm, discard bad images, quality-control the shorelines and many more functionalities.

Others have developed a different approach where everything is done on the GEE server, you can look at the work by Lujendijk et al. 2019 at a global scale using yearly composites (sounds very similar to what you are proposing to do). You can process images directly on the cloud with the GEE API but with more limited functionalities and control on the individual pixels of the image. Also, keep in mind that the GEE code is not open-source, so you can't see the source code to know exactly what each function is doing.

I personally use loads of hard-drives as you mentioned to generate the shoreline time-series over large spatial scales, see for example the CoastSat website. I like to keep a copy of the images in case I need to reprocess the datasets, but you could very well delete the images after extracting the shoreline to minimise memory allocation. From my experience, timewise, the bottleneck is on the image downloads, as the extraction of the shorelines is very fast (as long as you break down the coast on small polygons, ~25-30 sqkm seems to be the optimum).

Good luck with your project, Kilian

dbuscombe-usgs commented 2 years ago

This is a nice discussion, and since I have thought about some of these issues, I would also like to chime-in to add the following reasons why a local workflow generally makes sense:

  1. cloud computing is costly and may be a barrier to uptake (it is for me, for example; I have no institutionally provided access to cloud computing), but you could run coastsat in a cloud provider, either from a terminal with an X-server for graphics, or jupyterhub, so you are not downloading images to your personal machine
  2. cloud computing would make more sense if the processing routines in coastsat are computationally demanding, but they are not particularly so. Download times would not necessarily speed up on a cloud provider unless you were working directly in GEE
  3. image classifiers are not perfect; if you wish to develop your own classifier, that is easiest as a local workflow because it may require iteration. Also, if new better classifiers are developed in the future, you may simply point them at the imagery that you already have downloaded
  4. it would be nice to simply download the shorelines and other results from the cloud computer, but in situations where there is error, it is often instructively to visualize those images. Typically, cloud computers are VMs with a finite life, so you'd have to eventually download everything anyway if you wanted to archive your entire project
3enedix commented 2 years ago

Hi Kilian and Dan,

thanks for explaining your thoughts! I see a lot of good points (especially the reproducibility argument), have to think about others and learn more GEE. So far I was naively assuming that as one can 'store' the image in a variable (with the python API), it should be possible to use functionalities from other toolboxes to manipulate this variable. But apparently that's wrong... will keep learning. And I agree that using another cloud server would not help, as it would still require downloading the images, only then to the server.

Thank you!

neon-ninja commented 1 month ago

timewise, the bottleneck is on the image downloads

Is it possible to download multiple images in parallel? Perhaps with https://tqdm.github.io/docs/contrib.concurrent/?

Edit to answer my own question - yes - as long as you parallelise by site