marook / osm-read

an openstreetmap XML and PBF data parser for node.js and the browser
GNU Lesser General Public License v3.0
107 stars 25 forks source link

Why not store IDs as BigInt? #57

Open metabench opened 1 year ago

metabench commented 1 year ago

Once I obtain the records containing string ids (including the referenced nodes in the ways) I create new BigInt objects to replace their string representations.

Has any consideration been made of parsing them into BigInt values within osm-read?

Perhaps it would be a useful option to have if it were not to be done by default. Making it an option would avoid breaking changes for those who expect string values.

marook commented 1 year ago

I assume you create the BigInt by invoking it using the string id? For example: BigInt(id)

If this is the case I'm not sure if adding this behavior as a feature flag to osm-read is worth the effort. People which need the id in a number representation can easily do by themself.

Are there any more benefits of parsing the id within osm-read which I have missed @metabench ?

metabench commented 1 year ago

The earlier it's represented as BigInt the less time strings longer that 8 bytes need to be stored. It's not a big efficiency difference.

Getting the data from osm-read in the most appropriate type is the largest advantage as far as I can tell. Would make programming it easier and maybe a bit more performant.

metabench commented 1 year ago

There would likely be less processing to do between the data that's stored in the protobuf and having usable output if it were parsed as BigInt. I don't know whether or not there is anything in the osm-read codebase that would make it difficult to do, such as relying on a schema or dependency which already parses them into strings.

metabench commented 1 year ago

Looking at various TODOs such as https://github.com/marook/osm-read/blob/411aba24bc0e413d29d60e0249453c11ff1b8a52/lib/pbfParser.js#L335

There is no problem with integers of the size we get in OSM PBF files, such as for high node IDs. File positions beyond 2^32 are also fine.

"The Number.MAX_SAFE_INTEGER constant represents the maximum safe integer in JavaScript (253 – 1)." - MDN Web Docs.

It's worth noting that the numeric parts beyond 32bit are lost when doing binary operations such as '>>>'.

When representing these numbers in a TypedArray, 64 bit integer types should be used (signed or unsigned will work, but I go for unsigned when I am only supporting unsigned numbers).