potree / PotreeConverter

Create multi res point cloud to use with potree
http://potree.org
BSD 2-Clause "Simplified" License
672 stars 417 forks source link

Conversion stalls - throughput: -nanMPs #495

Open ntr-808 opened 3 years ago

ntr-808 commented 3 years ago

Some lazfiles seem to stall the conversion process with the converter reporting nanMPs and 1 cpu running at 100%. I will try find a small dataset that replicates this behaviour for testing.

The scale and offset warning seems a bit suspicious...

Latest develop (da93ec2)

-> ~/ptcv/build/PotreeConverter snip.laz 
#threads: 8
#paths: 1
WARNING: scale/offset/bounding box were adjusted. new scale: 0.001, 0.006477730174199732, 0.001, new offset: 6.9529870296860443e-310, 6.9529870296860443e-310, -153.52515684747772

output attributes: 
name                              offset    size
================================================
position                               0      12
intensity                             12       2
return number                         14       1
number of returns                     15       1
classification                        16       1
scan angle rank                       17       1
user data                             18       1
point source id                       19       2
gps-time                              21       8
Ring                                  29       1
Range                                 30       4
================================================
                                              34
================================================
cubicAABB: {
        "min": [0.000000, 0.000000, -153.525157],
        "max": [6955409.812625, 6955409.812625, 6955256.287468],
        "size": [6955409.812625, 6955409.812625, 6955409.812625]
}
#points: 19'344'113
total file size: 185.9 MB
target directory: 'snip.laz_converted'
maxPointsPerChunk: 967205

=======================================
=== COUNTING                           
=======================================
tStartTaskAssembly: 0.000040s
[14%, 1s], [COUNTING: 41%, duration: 1s, throughput: 9MPs][RAM: 1.0GB (highest 1.0GB), CPU: 99%]
countPointsInCells: 1.699138s
finished counting in 2s
=======================================
createLUT: 0.025644s

=======================================
=== CREATING CHUNKS                    
=======================================
distributePoints0: 0.000105s
distributePoints1: 0.000110s
WARNING: scale/offset/bounding box were adjusted. new scale: 0.001, 0.006477730174199732, 0.001, new offset: 6.9529870296860443e-310, 6.9529870296860443e-310, -153.52515684747772
[33%, 2s], [DISTRIBUTING: 0%, duration: 0s, throughput: -nanMPs][RAM: 1.1GB (highest 1.1GB), CPU: 79%]
[49%, 3s], [DISTRIBUTING: 47%, duration: 1s, throughput: 7MPs][RAM: 1.6GB (highest 1.7GB), CPU: 100%]
finished creating chunks in 2s
=======================================

=======================================
=== INDEXING                           
=======================================
[67%, 4s], [INDEXING: 0%, duration: 0s, throughput: -nanMPs][RAM: 1.9GB (highest 1.9GB), CPU: 33%]
[67%, 5s], [INDEXING: 0%, duration: 0s, throughput: -nanMPs][RAM: 2.5GB (highest 2.5GB), CPU: 34%]
[67%, 6s], [INDEXING: 0%, duration: 0s, throughput: -nanMPs][RAM: 2.5GB (highest 2.5GB), CPU: 35%]
[67%, 7s], [INDEXING: 0%, duration: 0s, throughput: -nanMPs][RAM: 2.7GB (highest 2.7GB), CPU: 32%]
[67%, 8s], [INDEXING: 0%, duration: 0s, throughput: -nanMPs][RAM: 2.8GB (highest 2.8GB), CPU: 31%]
...
midnight-dev commented 3 years ago

Thanks for putting the output in a code block. Much easier to read. 👌

Yes, that is very suspicious. For the bounding box given, there is no reason to adjust the offset to such a wild precision. 300+ digits is a lot of data; not what I'd consider typical numbers.

The internal state of the converter is not incrementing the time OR it's somehow getting polluted with bad values. That's the reason you're seeing -NaNMPs for the throughput. It's simple arithmetic that takes the number of points processed divided by the amount of time spent on a specific stage. There could/should be a safeguard in place for that output. Right now, it assumes there won't be absolutely 0 time passed before it prints that status message. Since it's running fast (possibly from tripping over the weird adjustment), there's a division by 0 error that leads to Not a Number results. You can see this during the chunking stage where the first line has the NaN throughput (with "MPs" appended) while the next line has a real number for the throughput after a little time has passed.

I have a few questions.

If there is no change in behavior with a small cropped version, please share the smallest copy via a cloud storage provider like Google Drive, Dropbox, or something similar. If this requires debugging, a few dozen to a couple hundred points would be better than millions of points (at least for me). Stepping through the converter logic to find problems is a slow process, even more so when there are many points.

ntr-808 commented 3 years ago

Thank you for the explanation.

In trying to create a minimum reproducible cloud I have:

All 3 resulted in a successful conversion afterwards, so my hunch is that something in the header throwing off the converter while it searches through the cloud. We often use this exact technique of running a cloud unmodified through PDAL to rebuild the header when the converter encounters points outside the bounding box and throws an error (but that is another story...)

Unfortunately I cannot currently give you the entire cloud to debug with, I will try to get it cleared soon.

In the meantime here is the PDAL metadata json of the file in question. Interestingly it also reports the min_x and min_y with an improbable precision:

https://gist.github.com/ntr-808/9357e0764bdb2b28ea473d7a4fd9a4ca

Hutch07 commented 3 years ago

I also stall without error


#paths: 30

output attributes:
name                              offset    size
================================================
position                               0      12
intensity                             12       2
return number                         14       1
number of returns                     15       1
classification                        16       1
scan angle rank                       17       1
user data                             18       1
point source id                       19       2
rgb                                   21       6
================================================
                                              27
================================================
cubicAABB: {
        "min": [1379762.717000, 14384893.002000, 14.307000],
        "max": [1466121.131000, 14471251.416000, 86372.721000],
        "size": [86358.414000, 86358.414000, 86358.414000]
}
#points: 72'536'319
total file size: 1.8 GB
target directory: 'D:\01_SSI_Projects\2019_Projects\19730_Amtrak\Test_Sliced_Traj\2nd_Traj\0\Computer_08\Computer_01\rail_converted_1'
maxPointsPerChunk: 3626815

=======================================
=== COUNTING
=======================================
tStartTaskAssembly: 0.003977s
countPointsInCells: 0.986806s
finished counting in 1s
=======================================
[33%, 1s], [COUNTING: 100%, duration: 1s, throughput: 74MPs][RAM: 0.0GB (highest 0.4GB), CPU: 80%]
createLUT: 0.024241s

=======================================
=== CREATING CHUNKS
=======================================
distributePoints0: 0.000297s
distributePoints1: 0.000348s
[45%, 2s], [DISTRIBUTING: 36%, duration: 1s, throughput: 27MPs][RAM: 0.6GB (highest 0.6GB), CPU: 96%]
[61%, 3s], [DISTRIBUTING: 84%, duration: 2s, throughput: 31MPs][RAM: 0.5GB (highest 0.6GB), CPU: 98%]
[67%, 4s], [DISTRIBUTING: 100%, duration: 2s, throughput: 33MPs][RAM: 0.3GB (highest 0.6GB), CPU: 9%]
finished creating chunks in 4s
=======================================

=======================================
=== INDEXING
=======================================
[67%, 5s], [INDEXING: 0%, duration: 0s, throughput: -na'n(i'nd)MPs][RAM: 1.0GB (highest 1.0GB), CPU: 84%]
[69%, 6s], [INDEXING: 6%, duration: 1s, throughput: 4MPs][RAM: 2.7GB (highest 2.8GB), CPU: 95%]
[69%, 7s], [INDEXING: 6%, duration: 1s, throughput: 4MPs][RAM: 2.5GB (highest 2.9GB), CPU: 97%]
[75%, 8s], [INDEXING: 24%, duration: 2s, throughput: 8MPs][RAM: 1.1GB (highest 2.9GB), CPU: 70%]
[96%, 9s], [INDEXING: 87%, duration: 3s, throughput: 18MPs][RAM: 0.4GB (highest 2.9GB), CPU: 23%]
[100%, 10s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 8%]
[100%, 11s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 12s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.1GB (highest 2.9GB), CPU: 6%]
[100%, 13s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 14s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 15s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 16s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 17s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%]
[100%, 18s], [INDEXING: 100%, duration: 4s, throughput: 16MPs][RAM: 0.2GB (highest 2.9GB), CPU: 6%] ```
midnight-dev commented 3 years ago

When this stall occurs, what files do you see in the output directory? For instance, chunks, metadata.json, hierarchy.bin, etc.

Hutch07 commented 3 years ago

When this stall occurs, what files do you see in the output directory? For instance, chunks, metadata.json, hierarchy.bin, etc.<

There is a subdirectory - chuncks with the file metadata.json there is a large file - octree.bin also the file tmpChunkRoots.bin

I did run a different data set and have the same issue. I saw the no-indexing option and tried it. The program did finish but with a "chunks" directory and r000000.bin and other files inside.

metadata.zip

m-schuetz commented 3 years ago

If it stalls at 100% indexing, then it might be because it takes a long long time to finalize the last chunk. This can happen if there is one chunk that is much much larger than all the other ones. That often happens if, for example, 90% of the data is concenctrated at 1% of the volume. I.e., if the bounding box doesn't really represent the actual content well, which can happen if there are some outlier points.

Hutch07 commented 3 years ago

I removed the last 4 files from the group of 24 and the PotreeConverter completed. I thought there may be an outlier or 2 in the data so I merged the 4 files and made sure they opened in CloudCompare and looked at the report. Things look OK with the files. I tried to run PotreeConverter on the joined/merged las file and it still will stall. Is there any way to troubleshoot what it is stalling on? Rail_info.zip