r-lidar / lidR

Airborne LiDAR data manipulation and visualisation for forestry application
https://CRAN.R-project.org/package=lidR
GNU General Public License v3.0
612 stars 133 forks source link

Workaround for limited hard drive space: good or bad idea? #778

Closed aeiche01 closed 2 months ago

aeiche01 commented 2 months ago

The LAS catalog approach is awesome, but seems to work only if I can store all the lidar data on my computer/cloud storage/etc. I'm running into an issue, however, where I want to analyze a large area (e.g., an entire U.S. state) and don't have the space to store all the laz files I'll need. I came up with a possible workaround, but I'm not sure if it's a good idea. If it is a good idea, though, it might be something to include as an option for use in lidR.

Basically, the idea is that we split the large ROI into smaller overlapping segments, like LAS catalog does with the las/laz files already. Then we use some API to download lidar files to disk for one of the smaller segments, analyze that as an LAS catalog, then delete the lidar files that aren't in the next smaller ROI (since the smaller ROIs overlap, there will be some lidar tiles that appear across multiple edge-sharing ROIs). Then we do that again, and again, and again, until we've covered the space of the entire large ROI. At which point, we can merge the outputs into what we need.

Is this a good idea, or am I making some major mistake or bad assumption? I wrote up some code to demonstrate how this could work. Haven't run it yet so there are probably bugs, but it covers what I was thinking. It uses dsmSearch, which is a package that grabs lidar data from the US National Map:

library(sf)
# Load the shapefile of the entire large roi. The roi covers an area where if we downloaded all lidar files,
# it would be too much data to store
large_roi <- st_read("your_shapefile.shp")

# Create a grid (adjust cell size as needed)
grid <- st_make_grid(large_roi, cellsize = c(5000, 5000)) # Set cell size based on your needs

# Expand each grid cell by a certain overlap distance (e.g., 100 meters)
grid_overlap <- st_buffer(grid, dist = 100)

# Split the shapefile into overlapping sections, so we get smaller sections to work with that overlap
# so we can still cover for any edge problems
large_roi_split <- st_intersection(large_roi, grid_overlap)

library(dsmSearch)
library(lidR)
# Take the sections one at a time
for(i in 1:nrow(large_roi_split)){
  roi<-large_roi_split[i,]
# Get a list of the lidar files that cover the roi
  lidar_file_urls<-(dsmSearch::lidar_search(st_bbox(roi)))$downloadLazURL
# Set folder in which to save .laz files
  folder_path = "path/to/directory/"

# Extract file names from URLs
  file_names <- basename(lidar_file_urls)

# Check which files already exist in the directory
  existing_files <- list.files(folder_path)

# Filter URLs for files that do not already exist in directory
  lidar_file_urls <- lidar_file_urls[!file_names %in% existing_files]

# Download the files
  for(j in 1:length(lidar_file_urls)){      # Start a loop that iterates over each URL in the 'lidar_file_urls' vector
    download.file(                          # Use the 'download.file()' function to download the file from the URL
      lidar_file_urls[j],                   # Specify the j-th URL from 'lidar_file_urls' to download
      paste0(folder_path,                   # Create the destination file path by concatenating 'folder_path' with the base file name
             basename(lidar_file_urls[j]))   # 'basename()' extracts the file name (e.g., 'file.csv') from the full URL, and 'paste0()' concatenates it with the folder path
    )
  }

# Start the LAS Catalog process
  lidar_file_names<-list.files(folder_path, full.names = TRUE)

# Create the catalog
  ctg <- readLAScatalog(lidar_file_names, filter =  "-keep_first")

# Create the lax index files to speed up the analysis process
  catalog_laxindex(ctg)

# Do whatever analysis we wanted to do. In this example, it's just a terrain rasterization
  opt_chunk_size(ctg) <- 500

  opt_output_files(ctg) <- "new/output/folder/DTM_chunk_{XLEFT}_{YBOTTOM}"

  # Raster will be written to disk
  rasterize_terrain(ctg, 1, tin())

# If this is not the last of the smaller roi pieces we are working with
  if(i < length(lidar_file_urls)){
    # Check to see whether the next roi has any of the same files from the last one (because of overlap)
    next_roi<-large_roi_split[i+1,]
    # Get a list of the lidar files that cover the roi
    next_lidar_file_urls<-(dsmSearch::lidar_search(st_bbox(next_roi)))$downloadLazURL

    # Extract file names from URLs
    next_file_names <- basename(next_lidar_file_urls)

    # Check which files already exist in the directory
    existing_files <- list.files(folder_path)

    # Create a list of files to delete. Remove the files that we will need for the next
    # roi from this list
    remove_files <- existing_files[!existing_files %in% next_file_names]

    # Delete files we don't need anymore
    file.remove(paste0(folder_path, remove_files))

  }else{ # if this was the last of the smaller roi pieces
    print("Done")
  }
}
Jean-Romain commented 2 months ago

The problem you are exposing is true for any other software on the market and is not specific to lidR.

Your idea is to create a kind of hierarchy with a meta-catalog associated to files you don't have on disk and download on the fly the files you need. Then delete those that are no longer needed. In this case why do you need a meta-catalog? We could do the same with a regular catalog without adding a layer of complexity.

Also lidR has never been designed to process state wide dataset. It was designed as a R&D toolbox. It does have limitations when it comes to process very large dataset. You should have a look to lasR. It does not resolve the problem you are exposing here but at least will better to handle large datasets.

Last but not least, my software lidR and lasR are no longer supported by my university. While the software will remain free and open source, I am now self-employed to sustain its development. I am offering my services independently for training courses, consulting, and development. If you are capable of sponsoring this feature this is something that I could develop for your custom needs. For more information, please visit my website: https://www.r-lidar.com/.