Update io_readLAScatalog.R to handle bad .laz or .las files

jstrunk001 commented 2 months ago

This modification to io_readLAScatalog.R allows "catalog" to handle corrupted .laz and .laz files with bad GUID values. The bad records are simply eliminated from the catalog. This is handled by pushing file handling to a couple of separate function. I am not sure if any of this code is compatible with the current lidR coding standards, but I personally use something like this as a replacement for the catalog() function in the lidR package because I often run into corrupt laz files...

The first suggested function is .fn_progress_headers() and handles progress bar updates and wraps the process of reading headers in a try() statement. This way, if reading a header fails, the catalog() function can simply carry on.

    #wrapper for: read one file, and update progress bar
    #fails gracefully
    .fn_progress_headers=function(progress,pb,i,N,t0,...){

      #update progress bar
      if (progress && (Sys.time() - t0 > getOption("lidR.progress.delay"))) {
        utils::setTxtProgressBar(pb, i)
      }

      #run one file - fail gracefully
      try(.fn_one_header(...), silent=T)

    }

The second suggested function is basically just a functionized extraction of code that JR already has in the main script

    #read one file header
    .fn_one_header = function(x,phblab){

          header <- rlas:::lasheaderreader(x)
          header <- LASheader(header)
          PHB <- header@PHB
          names(PHB) <- phblab

          #if (use_wktcs(header))
          if (T)
            PHB[["CRS"]] <- wkt(header)
          else
            PHB[["CRS"]] <- epsg(header)

          # Compatibility with rlas 1.3.0
          if (!is.null(PHB[["Number.of.points.by.return"]])) {
            PHB[["Number.of.1st.return"]] <- PHB[["Number.of.points.by.return"]][1]
            PHB[["Number.of.2nd.return"]] <- PHB[["Number.of.points.by.return"]][2]
            PHB[["Number.of.3rd.return"]] <- PHB[["Number.of.points.by.return"]][3]
            PHB[["Number.of.4th.return"]] <- PHB[["Number.of.points.by.return"]][4]
            PHB[["Number.of.5th.return"]] <- PHB[["Number.of.points.by.return"]][5]
            PHB[["Number.of.points.by.return"]] <- NULL
            PHB[["Global.Encoding"]] <- NULL
          }

          PHB$filename = x
          return(PHB)
    }

There are some further revisions in the body of readLAScatalog to modify how the progress bar is initiated and to handle the return of bad las or laz files including to issue a warning about the failed files.

    header <- LASheader(rlas::read.lasheader(files[1]))
    crs    <- st_crs(header)
    phblab <- make.names(names(phb(header)))
    phblab[4] <- "GUID"

    #set up progress bar
    N = length(files)
    t0 <- Sys.time()
    pb <- NULL
    i <- 0
    if(progress) pb <- utils::txtProgressBar(min = 0, max = length(files), initial = 0, style = 3)

    #get all headers - catch errors
    headers_list <- mapply(.fn_progress_headers, files, i=1:N, MoreArgs = list( phblab=phblab,progress=progress
                                                                                ,t0=t0,pb=pb, N=N), SIMPLIFY =F)
    #identify errors
    headers_err = sapply(headers_list,class) == "try-error"
    #warn errors
    if(sum(headers_err)>0) warning(paste("there were errors in", sum (headers_err)," files:", paste(files[headers_err],collapse=" ")))
    #combine remaining headers
    headers <- data.table::rbindlist(headers_list[!headers_err])

Jean-Romain commented 2 months ago

Please explain what was the original error. How a GUID could be invalid? It could be zeroed but I see no scenario where a GUID could prevent reading a file. Please explain me how your code solve the problem. I see a lot of code reshaping but can't grasp where is the actual fix especially without understanding what was the issue.

There is no way this change could be correct

#if (use_wktcs(header))
if (T)
   PHB[["CRS"]] <- wkt(header)
else
   PHB[["CRS"]] <- epsg(header)

jstrunk001 commented 2 months ago

There is no way this change could be correct
#if (use_wktcs(header))
if (T)
   PHB[["CRS"]] <- wkt(header)
else
   PHB[["CRS"]] <- epsg(header)

You are correct, this is not part of the suggested change above! I meant to revert this back to the original code. I used this to debug when I didn't have the lidR package loaded.

jstrunk001 commented 2 months ago

Please explain what was the original error. How a GUID could be invalid? It could be zeroed but I see no scenario where a GUID could prevent reading a file. Please explain me how your code solve the problem. I see a lot of code reshaping but can't grasp where is the actual fix especially without understanding what was the issue.

Here is the error:

The issue is how the catalog function handles corrupt .laz or .las files. For example, I have laz files for the entire state of Oregon. There is a single laz file that cannot be read by rlas, "header <- rlas:::lasheaderreader(x)". This means that after 10 minutes of scanning headers, the lidR::catalog() function fails with an error. lidR::catalog() does not have an option to bypass failed laz reads.

To fix this error, I encapsulate the lidar header reading / parsing code into a separate function. I initially only wrapped the header reading code in a try statement: code: 'header <- try(rlas:::lasheaderreader(x))' , but I found that there were cases when the header read was successful, but somehow the other parts of the header parsing code failed. Therefore, I simply placed the entire chunk of header reading / parsing code into an external function. I then use mapply across all input files to read all of the individual laz/las files with your existing header reading / parsing code - but encapsulated in a function. If it is in a separate function, we can wrap the call in a try() statement.

.fn_one_header = function(x,phblab){

      header <- rlas:::lasheaderreader(x)
      header <- LASheader(header)
      PHB <- header@PHB
      names(PHB) <- phblab

      #if (use_wktcs(header))
      if (T)
        PHB[["CRS"]] <- wkt(header)
      else
        PHB[["CRS"]] <- epsg(header)

      # Compatibility with rlas 1.3.0
      if (!is.null(PHB[["Number.of.points.by.return"]])) {
        PHB[["Number.of.1st.return"]] <- PHB[["Number.of.points.by.return"]][1]
        PHB[["Number.of.2nd.return"]] <- PHB[["Number.of.points.by.return"]][2]
        PHB[["Number.of.3rd.return"]] <- PHB[["Number.of.points.by.return"]][3]
        PHB[["Number.of.4th.return"]] <- PHB[["Number.of.points.by.return"]][4]
        PHB[["Number.of.5th.return"]] <- PHB[["Number.of.points.by.return"]][5]
        PHB[["Number.of.points.by.return"]] <- NULL
        PHB[["Global.Encoding"]] <- NULL
      }

      PHB$filename = x
      return(PHB)
}

I made a separate function that wraps the header reading call in a try() statement. I handle the progress bar at the same time.

.fn_progress_headers=function(progress,pb,i,N,t0,...){

  #update progress bar
  if (progress && (Sys.time() - t0 > getOption("lidR.progress.delay"))) {
    utils::setTxtProgressBar(pb, i)
  }

  #run one file - fail gracefully
  try(.fn_one_header(...), silent=T)

}

Jean-Romain commented 2 months ago

I see. Can you please share the problematic file?

jstrunk001 commented 2 months ago

u please share the problematic file?

Do you have a place to drop this 2 gb file?

jstrunk001 commented 2 months ago

I see. Can you please share the problematic file?

although I think it is important to build in some error handling into the catalog function to have the capacity to skip bad files

Jean-Romain commented 2 months ago

Thank you for pointing out the issue. I initially missed the actual error because it was sent as a screenshot rather than copied text.

The error message could definitely be improved to be more specific and informative.
The issue you're reporting can be addressed in just a few lines of code—there's no need for a major code overhaul.
Regarding readLAScatalog, I’m not entirely convinced it should automatically skip invalid files. Since the file is corrupted, it should ideally be removed or repaired rather than skipped. However, I understand there are pros and cons to this approach. I'll try to reproduce the issue on my end and assess the best course of action.

Jean-Romain commented 2 months ago

Problem addressed in 39d1f19

jstrunk001 commented 2 months ago

Great!!

On Mon, Sep 9, 2024, 5:10 AM Jean-Romain @.***> wrote:

Problem addressed in 39d1f19 https://github.com/r-lidar/lidR/commit/39d1f196b22fa8ee56dcbe73dbd337102df54ffd

— Reply to this email directly, view it on GitHub https://github.com/r-lidar/lidR/pull/776#issuecomment-2337953112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIEGC4UB2UYOBBH7KX23ZTZVWF4LAVCNFSM6AAAAABNTBTTA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZXHE2TGMJRGI . You are receiving this because you authored the thread.Message ID: @.***>

r-lidar / lidR

Update io_readLAScatalog.R to handle bad .laz or .las files #776