laspulse assigns multiple pulses the same pulseID

nkruskamp commented 7 years ago

Thanks so much for developing lidR. It's been a great help processing data. I came across an issue when comparing my own measure of pulse density to the one produced by lidR.

My assumption is that each pulse cannot have more than one first return. Therefore, the number of pulses should equal the total number of first returns. Given that, it appears that laspulse will assign the same pulseID to multiple pulses when the GPS times are not sufficiently unique to identify the true pulse count. This leads to an underestimate of pulse count and incorrect calculation of pulse density.

Thank you for looking into this!

Jean-Romain commented 7 years ago

Hi, thank you for your bug report. Could you quantify the error? How many pulses detected for how many first returns? We already discussed the question of the use of GPS time in a former issue #32 and I'm petty confident on the method. So I'm wondering if the problem can come from the edge of the data where pulse can be badly detected or if it can come from the fact your GPS times are not enough accurate.

At first sight I think you're right and the second hypothesis seems more likely to be true. A former algorithm was based on the return numbers (see #32) but this method lead to wrong attributions on the edge of the data. As I said the method based on GPS time is more robust and it is strange you encountered errors. Would you be able to share a reproducible example with your data to investigate further?

nkruskamp commented 7 years ago

Thank you for the response.

Here is a test tile from the USGS: ftp://rockyftp.cr.usgs.gov/vdelivery/Datasets/Staged/Elevation/LPC/Projects/USGS_LPC_VA_Shenandoah_2013_LAS_2015/las/tiled/USGS_LPC_VA_Shenandoah_2013_DO_N16_4873_20_LAS_2015.zip

With the above example:

with overlap, I get a total of 11882723 first returns and 5963075 pulses with densities of 0.47 and 0.24 pulses/m^2 respectively.
with overlap removed, I get 8165625 first returns and 2391861 pulses with densities of 0.19 and .09 pulses/m^2 respectively.

I looked at the pulse numbers from the lidR summary, and also compared the first return numbers to a raw input data.table of the points using rLiDAR.

It does appear the GPS times are identical for first returns assigned identical pulseIDs. Since this is coming from the vendor, I don't know if I could work around identical GPS times.

Jean-Romain commented 7 years ago

Well your data are wrong in a way or another. The gpstime seems accurate but the relationship between gpstime, ReturnNumber and NumberOfReturns ìs definitively incorrect.

This is a short subset of your data. Look, line 1 the ReturnNumber is 1 and the NumberOfReturn is 1 as well. It makes sense. But at the line number 2, something seems wrong. The ReturnNumber is 2 so it means this is a second return. So the pulse has at least two returns. So the NumberOfReturn should be at least 2 on line 1. It isn't. Moreover the gps times are different (at 10⁻⁷)

          X       Y       Z  gpstime Intensity ReturnNumber NumberOfReturns Classification ScanAngle pulseID
1: 11471934 6836707 1531.60 80765230        52            1               1             18       -21 1718860
2: 11471931 6836708 1531.31 80765230        48            2               2             18       -21 1718861
3: 11471929 6836709 1530.88 80765230        38            1               1             18       -21 1718862
4: 11471926 6836710 1530.63 80765230        36            2               2             18       -21 1718863
5: 11471923 6836711 1530.36 80765230        22            2               2             18       -21 1718864
6: 11471920 6836712 1529.99 80765230        28            2               2             18       -21 1718865

Here how it looks like in a file where everything is correct :

           X       Y     Z  gpstime Intensity ReturnNumber NumberOfReturns Classification ScanAngle pulseID
33: 684990.7 5018001 18.61 483825.9        34            1               1              1         5      28
34: 684990.3 5018001 19.44 483825.9        25            1               1              1         5      29
35: 684989.9 5018002 18.35 483825.9        33            1               2              1         5      30
36: 684990.0 5018003 15.15 483825.9         7            2               2              1         5      30
37: 684989.5 5018003 18.46 483825.9        25            1               2              1         5      31
38: 684989.6 5018004 14.31 483825.9        12            2               2              1         5      31
39: 684989.1 5018004 18.60 483825.9        31            1               2              1         5      32
40: 684989.1 5018004 14.99 483825.9        13            2               2              1         5      32
41: 684988.7 5018005 16.98 483825.9        27            1               1              1         5      33
42: 684988.2 5018006 18.05 483825.9        44            1               1              1         5      34
43: 684987.8 5018007 17.46 483825.9        25            1               1              1         5      35
...
57: 684990.6 5017999 16.58 483825.9        35            1               3              1         5      46
58: 684990.7 5017999 11.53 483825.9         6            2               3              1         5      46
59: 684990.9 5018001  0.00 483825.9         2            3               3              2         5      46

So what is wrong? I don't have a clue. The dataset is ordered by gps time. If the gpstime is corrupted, then, everything becomes wrong as well. If the gpstime is not corrupted then it means the ReturnNumber and the NumberOfReturns are all corrupted and they don't have a meaning.

To investigate further you should try to figure out who provided the dataset (file was generated with GeoCue) and what kind of manipulation has been done on it. I suspect a decimation of the point cloud but I can't understand how it can affect the correctness of the gps time pattern.

nkruskamp commented 7 years ago

Thanks for looking at this data and catching the abnormality. I will try to track down the original vendor for this acquisition. The other datasets I used to examine the issue were from the same collection so not surprisingly, I saw the same issue repeated in those tiles. There is a way to repair the number of returns assuming the return number and gps time is correct. I will try to repair the file and see if that resolves the issue.

Thanks again. I really enjoyed your recent paper on pulse density and footprint size. It is already helping me refine my own metrics.

Jean-Romain commented 7 years ago

There is a way to repair the number of returns assuming the return number and gps time is correct.

I think it is a bad idea to assume that something is correct until you do not figure out how the point cloud has been processed.

Good luck with your data. Let me know if you figure out what is the problem. I'm interested.

nkruskamp commented 7 years ago

You're right - I have to figure out what is correct and what isn't first. I'm just hoping that repairing those values is the simplest solution.

nkruskamp commented 7 years ago

I called the vendor to get some help with this issue. They explained that the sensor used (Leica ALS 70) is a two channel system - Therefore two pulses can (and likely will) have identical GPS times. The channel is identified in the UserData of the LAS file. This is good to know from a data analysis stand-point, as it might require additional pre-processing before identifying pulses.

Jean-Romain commented 7 years ago

Hi, sorry for the delay I was in vacation.

I looked at your data and the UserData field contains 0 for each point. So I don't know how to help you. If I was able to retrieve the channel information I would be able to provide you a script designed for your specific case.

nkruskamp commented 7 years ago

Thanks again - I'm not sure why that information is missing for you. My file does have the UserData column with values of 1 or 2 for each point. Thank you for the offer to develop for this special case, I can work around this for now fairly easily.

Are you supporting the newer Point Data Record Formats?

It looks like scanner channel was added with Point Data Record Format 6 and up, so for 5 and below that information had to be put somewhere else (such as the UserData). It might remain up to the user to be aware that if the data were collected by a multichannel system and the record format is 5 or below, they should make sure the results are correct. The newer record formats would make this easier from the developer side to check that scanner channel data first.

Jean-Romain commented 7 years ago

The rlas package is the backend which enable to read and write las and laz file in lidR. It is basically a wrapper over the LASlib library. If LASlib support newer point data records formats, then I can do it as well. But I disabled some features for some file types I never encountered to ensure to do no create unexpected bugs. If you have some non supported files, just share a file and I will enable the support of the format.

But I'm wondering why you have non-zero values in UserData and I don't. I will investigate tomorrow.

r-lidar / lidR

laspulse assigns multiple pulses the same pulseID #62