opencitymodel / data-pipeline

Open City Model data pipeline
MIT License
5 stars 2 forks source link

verify CityGML files after they are created #26

Open agilliland opened 5 years ago

agilliland commented 5 years ago

When we are generating citygml files we should add a quick verification step which opens up the newly written citygml files and attempts to verify them before moving on to the next file as sometimes files can end up being corrupted.

clausnagel commented 5 years ago

Just curious, why are the files sometimes corrupted?

agilliland commented 5 years ago

We saw some corrupted files due to the JVM failing from OutOfMemory exceptions. So in those cases a citygml file would have stopped writing in the middle of a city object and thus not be valid XML anymore.

Unfortunately the coding bug was that the orchestration scripts running the citygml generation weren't properly identifying that the JVM process had failed and as a result it continued to push the corrupted files along the pipeline.

I've fixed the orchestration script so it properly detects errors and stops the pipeline at that point, but this issue is here as a reminder to add in even more verification.

clausnagel commented 5 years ago

Thanks for explaining. Saw that you are using citygml4j and thought that maybe it causes the issue. In this case I possibly could have helped to solve it.

clausnagel commented 5 years ago

Checked your CitygmlBuilder and saw that you are creating up to 40k buildings in main memory which are then written to a file. 40k LOD1 buildings should not occupy much main memory though.

citygml4j supports writing chunks. You could create a building, send it to the file and remove it afterwards. This helps to keep the memory footprint low and works for both GML and CityJSON. However, in CityJSON, city objects are not self-contained due to global properties like the "vertices" array. So, even when writing buildings chunk-wise, some information must be kept in main memory until the CityJSON file is closed. Nevertheless, the memory footprint should still be substantially lower compared to creating all buildings in main memory.

Using chunk-wise writing would only require few changes to your code.

agilliland commented 5 years ago

Yeah, that would be really helpful! I've definitely seen the CityJSON creation take up much more memory than I had expected and given the way the vertices are managed that makes sense, but anything to keep the memory footprint down a bit more would be welcome.

clausnagel commented 5 years ago

I'm happy to work on an example how it could look like.