Open kadyb opened 1 year ago
Nice topic!
we have terabytes of remotely acquired data,
I think a major challenge is that we have petabytes rather than terabytes of RS data, and that that requires very different tools, skills and resources to manage than what R is meant for.
is the R ecosystem ready for that and do we have the specialists?
I personally don't think of this in terms of "ready or not" (nor in terms of "recent leaps"), but several people, including me, are actively working in this direction, and some more coordination between them would probably be good for everyone. Another question is whether the RS community is ready for R, or more in general for doing statistical inference rather than ML.
Thanks for your comment!
Another question is whether the RS community is ready for R (...)
I think there has always been a strong connection between R and RS, especially in research (see {raster}
citations). However, in the past, several satellite scenes were analyzed but nowadays, as you mentioned, we have petabytes of data to analyze, which is a challenge.
Heh, the question which I was asking myself for a while..
Basically, we have terabytes of remotely acquired data
@edzer is right, peta. And IMHO is not about data which is available, rather is about knowledge what's behind the data and how to utilize it. And then, how to process it. I would say, that the tools are available, in one or another form, it's just a matter of choosing the proper one / build a stable workflow.
Do you also have the impression that there has been a technological leap recently, but the available teaching materials and workflows are outdated?
Yes, they are. They are scattered across many blogs, publications or just a vignettes.. Some of them are very superficial, they only touch on the subject, some are a bit better. I had a dream :). A publication which shows the power of open data in spatial science. What might be the outcome, from where get the data and how to process it. If your goal is to spend the rest of the life to work on such multi-volume live edition, then I'm in.
Regards, Grzegorz
Here is @appelmar great presentation and examples about this topic: https://appelmar.github.io/CONAE_2022/
Dear @kadyb @edzer @applemar @gsapijaszko and others,
@kadyb asked me to share my thougts on the matter here. Apologies in advance for a long thread. In what follows, I will concentrate on topics related to big Earth observation (EO) data analysis in R, which by extension implies also dealing with small data sets.
(a) Access to big EO data collections: Apart from GEE, most EO cloud providers support the STAC protocol. In R, supporting STAC has been well addressed by the rstac
package, which is operational and reliable.
(b) Creation of EO data cubes from cloud collections: this is an issue vastly underestimated by EO developers, especially those in Python. Part of the problem is the conflation between the abstract concept of data cubes
and the concrete data structures of xarray
. Python packages such as OpenDataCube
and Xcube
make this mistake. It is very easy for them to develop Jupyter notebooks using xarray
to process small areas and provide nice examples. It is hard to use xarrays
to process big areas that do not fit into the memory of VM. In the R world, thanks to @appelmar and @edzer, gdalcubes
solves the problem, using efficient parallel processing to actually create data cubes in disk. I can hear some of you screaming: "is this data materialization necessary?", "can't we work only with the raw data?". Believe me, building the actual data cubes in disk is essential.
(c) Satellite image time series analysis: despite my obvious conflict of interest, I will argue that sits
is operational and reliable. It is been used for large data products in Brazil, such as the LUCC map of Amazonia (4 million km2) and Cerrado (2 million km2) with 10-meter Sentinel data. The development of sits
has only been possible because it relies on stars
, sf
, gdalcubes
, rstac
, torch
, e1071
, caret
, terra
, tmap
, leafem
, Rcpp
and the tidyverse. Since there is no comparable basis in Python, arguably an end-to-end package such as sits
would only be possible in R.
(d) Single image analysis: remarkably, there is currently no package in R that supports traditional 2D image processing using data cubes. gdalcubes
does a great job in providing image operations, but R also needs a package that provides ML/DL algorithms ranging from random forests to CNN and transformers.
(e) Object-based image analysis: another area where R is lacking in effort. We need a package that would do OBIA in connection with EO data cubes. Even simple region-growing segmentation algorithms are missing.
(f) Deep learning algorithms: another area where R lags behind Python. For image time series, our team at INPE was fortunate to be supported by Charlotte Pelletier, who helped us to convert current state-of-the-art PyTorch algorithms such as her own tempCNN
, and temporal attention encoders from Vivien Garnot to R. However, these DL algorithms are designed for image time series (which are the focus of sits
). We lack 2D DL-based methods in R.
(g) Converting DL algorithms from PyTorch to R: Translating from pyTorch
to r-torch
is relatively straightforward, if the code in PyTorch is clean (which is not always the case). We invested months of work to translate the transformer-based algorithms. The hard part was understanding the algorithm (transformers are a mess). With time and effort, we can adapt any EO algorithm developed in PyTorch to run in R. We have to accept this situation, since young coders love to work in PyTorch.
(h) Image data formats: here, it appears to be an ongoing battle between ZARR and COG. Most data providers such as MPC and AWS provide their 2D imagery in COG. However, for multidimensional data used in Oceanography and Meteorology, ZARR is replacing netCDF. For R developers, the good news is that GDAL supports both ZARR and COG. The bad news is that their significant differences have relevant consequences for efficiency when processing large data sets.
(i) Parallel processing using COG: The COG data format is well-suited for parallel processing. Since images are stored in chunks, processing large data sets in virtual machines with many cores is (conceptually) simple. Each VM core receives a chunk, processes it, and writes the result to disk. Partial results are then joined to get the final product. All happens behind the scenes. By contrast, Python users that use xarray
have to do considerable amount of work to connect their data to dask-clusters. The current xarray
based EO data processing in Python is not user-friendly. I have yet to see really large-scale work (more than 1 million km2 at 10 meter resolution) done with xarray
and dask
.
(h) Active and self-supervised learning: this is another area where we need serious investment. The quality of training data is the defining factor in EO product accuracy. However, there is much more current effort (as measured by published papers) in new algorithms than in better learning methods. I work directly with remote sensing experts who understand the landscape. They go to the field and assign labels to places. Then, they take an image of the area close to the date where they were in the field, and use the samples collected to classify the image. In the large-scale work we are doing in Brazil, it is hard for them to assign labels to multi-temporal data. Bare soil areas in the dry season will be wetlands in the rainy season. There may be two or three crops per year. Farmers may mix grain production with cattle raising. Further progress on big EO data analysis requires methods to support data labelling. The need for self-supervised learning is well recognised in the deep learning community. Please read what Yann LeCun (VP of Meta for AI) wrote in this blogpost:
https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/
(i) GEE x MPC: for me, there is no choice. GEE is frozen. In MPC, we can make real progress, developing and testing new methods.
To sum up and consider @kadyb question
Basically, we have terabytes of remotely acquired data, but is the R ecosystem ready for that and do we have the specialists? What are your thoughts?
I firmly believe that progress in big EO data analysis is driven by community efforts, and not independent developers producing the next algorithm. The R ecosystem is much more community-driven than Python. Developing sits
thought me a lot about sharing, reuse and trust in your fellow developers. We need to continue working together. The well-know proverb applies: "if you want to go fast, go alone; if you want to go far, go together".
Thanks for this thread, I have a question for @gilbertocamara , you mention that "The COG data format is well-suited for parallel processing. Since images are stored in chunks, processing large data sets in virtual machines with many cores is (conceptually) simple". Have you actually seen any of this being implemented? I have a colleague that is very good with DASK, so it would be very interesting to do some benchmarking of this two approaches
Dear @derek-corcoran-barrios, we use the "sits" package for large-scale land use classification for Brazilian biomes such as Amazon (4 million km2) and Cerrado (2 million km2). For your information, LUCC classification in Amazon uses 4,000 Sentinel-2 images, each with 10-meter resolution, and 10,000 x 10,000 pixels. The data cube has 12 bands per image and 23 temporal instances per year. The total data is 220 TB. 150,000 samples were selected to train the deep learning classification algorithm. These data sets do not fit in main memory, even in case The "sits" software optimizes I/O access to large EO collection using COG files.
The comparison with Python(xarray/DASK) has to be considered with care. Xarray is primarily an in-memory data structure, and DASK would then be limited to paralellizing data in memory. For big areas such as the Brazilian biomes, the combination of xarray/DASK faces challenges of scalability.
We can provide a script that uses data in Microsoft Planetary Computer to run a classification of a large area in Brazil and provide the training data so that you can try to replicate the result using xarray/DASK and we can compare the results.
Thanks @gilbertocamara it would be amazing, we have a hybrid team in our workplace with python and R for spatial analyses, we are right now debating best packages/librearies and formats. It would be awesome if you could provide such script.
Thanks
Since the first post on this topic, the situation has improved and more training materials and examples have appeared. Here is the list of interesting tutorials on processing large Earth Observation (EO) data in R:
STAC (SpatioTemporal Asset Catalogs):
STAC with GDALCUBES:
SITS (Satellite Image Time Series):
openEO:
rsi (Retrieving Satellite Imagery):
Very happy to see that first link here! FWIW, my understanding is that those tutorials are going to be incorporated into stacspec.org in a few months (this is attached to a redesign of the website, so no precise timeline), at which point that'll be the actual URL. Just wanted to drop crumbs for future readers in case that link stops working (and to say that, should anyone have feedback on the tutorials, I'd love to incorporate suggestions!)
For more resources on SITS and satellite image time series in general, please see my interview with Robin Cole: https://www.youtube.com/watch?v=0_wt_m6DoyI
@kadyb, the package dtwSat
offers an additional resource for land use mapping, utilizing multi-dimensional (multi-band) satellite image time series https://r-spatial.github.io/dtwSat.
I would like to raise topic and hear your opinions about status of remote sensing in R. Do you also have the impression that there has been a technological leap recently, but the available teaching materials and workflows are outdated? There are so many new things comparing to the past that it can be overwhelming. Here I listed some hot topics:
There is probably a lot more, but I have no idea about that.
I think the most popular book for remote sensing in R is the one from Aniruddha Ghosh and Robert Hijmans (https://rspatial.org/terra/rs/index.html), but it covers very basic topics and focuses on local computing, not in the cloud. Online tutorials also focus on very simple topics and use the old
{raster}
package. Nevertheless, most of the mentioned things are covered on the r-spatial blog.Basically, we have terabytes of remotely acquired data, but is the R ecosystem ready for that and do we have the specialists? What are your thoughts?