r-lidar / lasR

Fast and Pipable Airborne LiDAR Data Tools
https://r-lidar.github.io/lasR/
GNU General Public License v3.0
46 stars 1 forks source link

write_vpc() does not populate datetime #44

Closed wiesehahn closed 2 months ago

wiesehahn commented 2 months ago

It is populated as "datetime": "0-01-01T00:00:00Z" for all input files.

Instead I guess it should be populated from

las@header$`File Creation Day of Year`

and

las@header$`File Creation Year`

at least it seems this is what PDAL/QGIS does.

But actually I am wondering how these attributes are defined, in the specification it says

The year, expressed as a four digit number, in which the file was created.

and in the cases I have seen so far it is literally the date the file was created (after initial processing), however I guess more relevant in most cases is the date/time the data was gathered. Do you know anything about the spec?

Jean-Romain commented 2 months ago

Absolutely, it should be populated with File Creation Day of Year and File Creation Year. These are the only values that can be used to fill this attribute.

I have already written the code to compute the date. However, in its current state, write_vpc is a special stage that does not function like the other stages (it is evaluated before any other stages outside the pipeline), in a place where we do not have access to these attributes of the header. This stage must be rewritten in depth to behave like other stages and to be executed in the pipeline. This would require a significant amount of work for very little added value (just a date that we are usually not even sure what it corresponds to).

Thus, it is a low priority on my to-do list

wiesehahn commented 2 months ago

quite confusing to me these attributes are actually not referring to aquisition date and it seems there is no such attribute in the header. I understand that it can be different for each point, however for small tiles it will mostly be the same day.

Anyway, would it then make sense to give some option to define datetime manually if we cannot get it correctly from the pointcloud file?

Jean-Romain commented 2 months ago

Anyway, would it then make sense to give some option to define datetime manually if we cannot get it correctly from the pointcloud file?

It is actually even easier to query the date from the files in the current state of the code. So I did it. The datetime is now written.

wiesehahn commented 1 month ago

It is actually even easier to query the date from the files in the current state of the code. So I did it. The datetime is now written.

So right now datetime is queried from File Creation Day of Year and File Creation Year?

While this is also done in PDAL I think this is not really correct, or at least most people/use-cases working wth STAC collections will likely want to query by acquisition date and not the date when the data was processed (which I learned is what File Creation Day means).

The field created would fit better actually than datetime

I guess its not possible to query timestamps from individual points here, because then the entire file has to be read? If this would be possible without drastically slowing the creation process it might be a good option to populate start_datetime and end_datetime (see https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#datetime-selection) with min and max timestamp.

Jean-Romain commented 1 month ago

So right now datetime is queried from File Creation Day of Year and File Creation Year?

Absolutely

While this is also done in PDAL I think this is not really correct,

The PDAL team created the VPC format. PDAL does it this way and I'm following PDAL. This is the best I can do and I'm not going to interpret it differently than the team who created the format.

which I learned is what File Creation Day means).

In theory. In practice nobody knows what is this date. For example lidR, lasR and LAStools do not modify the date in the header when processing a file. So, the date stay unchanged after processing and does not correspond to the creation of the file but rather to the creation of the original file (if the data provider respected this specification).

I guess its not possible to query timestamps from individual points here, because then the entire file has to be read?

Absolutely, it would require hours to create a VPC (for large collections). Moreover some file format do not have gpstime (but they are rare nowadays)

it might be a good option to populate start_datetime and end_datetime

This is a very good idea. However, on my side I'm not sure to know how to interpret the gpstime field properly. Let's look at a file

File signature:           LASF 
File source ID:           0 
Global encoding:
 - GPS Time Type: GPS Week Time 
[...]
File creation d/y:        202/2023
[...]

> range(las$gpstime)
[1] 304949740.259605 304951756.222482

How am I supposed to transform [304949740.259605 304951756.222482] into dates knowing that these number correspond to the GPS Week Time. I read about GPS time definitions. As far as I can tell it is not straightforward

wiesehahn commented 5 days ago

I just came back to this as retreiving the acquisition date from a LAS file is something I would often need. The problem is that there is no specification in the LASheader about the data acquisition datetime, I dont know why this is and who decided that file creation day and year should be populated with the processing date (makes no sense to me). What we get from providers is some stupid shapefile with dates which are meant to show acqisition dates by regions, however merging these with the pointcloud tiles is extra labour and also these files are often out of date or erroneous. So the only real option I see is to use the gpstime attribute of point data records.

Converting gpstime to datetime is not trivial/possible with las files in LAS1.0 - 1.2 specification where gpstime was populated with gps seconds-of-week (the case you mentioned above). In this case you also need the acqusition week to convert to datetime.

However, with newer specifications LAS1.3 onward (https://www.asprs.org/wp-content/uploads/2010/12/LAS_1_4_r13.pdf) standard gps time (minus 1e9 = adjusted standard gps time) should be used, in which case global encoding bit 0 is set. This can be converted. Although I am not sure about the time yet due to possible time zone issues, I think something like the following should work. Some infos I came across are in https://groups.google.com/g/lastools/c/_9TxnjoghGM.

get_daterange = function(f){
  # is (adjusted) starndard GPS time used?
  if(f@header$`Global Encoding`$`GPS Time Type`){
    gps_epoch <- as.POSIXct("1980-01-06 00:00:00", tz = "UTC")
    date_time_min <- gps_epoch + min(f@data$gpstime[f@data$gpstime>0]) + 1e9
    date_time_max <- gps_epoch + max(f@data$gpstime) + 1e9
  }
  return(c(date_time_min,
         date_time_max))
}
Jean-Romain commented 5 days ago

Everything your wrote is correct. Let me know if I can help. But remember what I told you by email. I'm no longer working at university and lidR and lasR are no longer supported by my university. In order to be sustainable, I'm now self-employed and I'm selling services and consulting. If you want me to develop custom tools for you, this is the kind of services I'm offering. You can contact me at info@r-lidar.com (r-lidar.com is upcoming) or keep contacting me via Github.

wiesehahn commented 4 days ago

I didnt knew you were not working there anymore at all. I will see if there is some kind of possibility to fund this project. Unfortunately buying software would be much easier for us (public institution) than donating/funding I think.

Just to write down my thoughts here: I think it would be really nice to get the daterange or some approximate datetime when the data was acquired, as the data in the header is more or less useless if data was processed severel months after acquisition. And especially with large and/or multitemporal data collections the time might be essential for analysis. If we cant store this in the LASheader adding this to the VPC would probably be the next best solution (also appening the date to the filename is done sometimes). While my solution above works it is quite time consuming as entire point data has to be read. Do you think this could be achieved a lot faster? I mean calculating min/max via C++ will be faster but I think this is not the bottleneck in relation to the time it takes to read the file. For most use cases I guess it would be sufficient to get the day (without hh:mm:ss) for which we would just need 1 point in like 99% where all data within a tile is from the same campaign. But I guess it is not possible to read only part of a LAZ file, is it?

Jean-Romain commented 4 days ago

Unfortunately buying software would be much easier for us (public institution) than donating/funding I think.

No donation or founding. Just a regular contract between your institution and me as a company to get support of any kind (training course, development of custom tools, consulting). If you want to talk about that later reach me at info@r-lidar.com where I'll give you more details.

I think it would be really nice to get the date range or some approximate datetime when the data was acquired, as the data in the header is more or less useless if data was processed several months after acquisition.

100% agree. However what you are expecting is something not broadly useful for most people. It seems likes a feature useful for people dealing with large dataset in public institutions. It is a lot of work (mainly because dealing with datetime and gpstime conversion will be tedious) for limited usage.

If we cant store this in the LASheader adding this to the VPC would probably be the next best solution

100% agree. We can do both but to store that in the header we need to rewrite the files (long process)

While my solution above works it is quite time consuming as entire point data has to be read. Do you think this could be achieved a lot faster? I mean calculating min/max via C++ will be faster but I think this is not the bottleneck in relation to the time it takes to read the file.

You are 100% correct. Since the only way to find the data is to read the gpstime we must read the files and this is the bottleneck. But this process can be streamed without loading anything in memory. It is faster. And you do it only once. I would not be too concerned by the computation time.

For most use cases I guess it would be sufficient to get the day (without hh:mm:ss) for which we would just need 1 point in like 99% where all data within a tile is from the same campaign. But I guess it is not possible to read only part of a LAZ file, is it?

Yes of course we could read only one point. Or one point every 100th points. This would not be the default behavior because of the 1% of cases but if you are confident that a single point is enough for your need this would become an almost instantaneous operation.