marook / osm-read

an openstreetmap XML and PBF data parser for node.js and the browser
GNU Lesser General Public License v3.0
107 stars 25 forks source link

Would exposing an async iterator API suit this library? #56

Open metabench opened 1 year ago

metabench commented 1 year ago

I have used the random access features in osm-read to make a different way to iterate through records in the osm.pbf file. The problem I was having before was that since reading this OSM was faster than inserting it into SQLite, the reading got ahead of writing, and pause was not working (as I expected whereby it would immediately pause the output). I decided to implement an asynchronous iterator interface, not within the codebase of osm-read but using its API.

The code I use to iterate objects is as follows:

let c = 0;
let type_counts = {};
for await (const item of reader.objects) {
    //console.log(item);
    const {type} = item;
    if (!type_counts[type]) {
        type_counts[type] = 1
    } else {
        type_counts[type]++;
    }
    c++;
}
console.log('c', c);
console.log('type_counts', type_counts);

For guernsey-and-jersey I get the following output:

c 513686
type_counts { node: 461240, way: 51971, relation: 475 }

The code I have here is concise, will run quickly when the results are requested quickly, but will also run as slow as needed when the result processing takes more time.

Is this a feature that's worth incorporating into the library?

I would like to coordinate with @marook regarding including this and possibly other features in osm-read.

marook commented 1 year ago

I think the part of the API which is currently not behaving very well regarding asynchronous processing are especially the node, way and relation callbacks. Right now osm-read does not interpret the return values of these callbacks. I agree that this is an issue when processing of nodes, ways or relations can not be performed in a blocking way.

My first intention right now would be to allow the node, way and relation callbacks to return undefined or promises. If the callback returns undefined we should stick with the current behavior of continue processing the next element. This will hopefully not break the API for existing users.

If the callbacks return a promise we could block until the promise resolves.

@metabench Is this the solution you had in mind? Or what disadvantages does it bring compared with the solution you implemented in your private fork?

metabench commented 1 year ago

@marook I did not implement it in a private fork, but in a private codebase which uses osm-read as a dependency. I'd be willing to license that specific code under MIT or LGPL. The solution I have in mind does not use the existing callback interface, so means there are no changes, breaking or otherwise, to be made to that part of the code.

My code provides an easy way to iterate through all objects, or just nodes, ways, or relations. I am new to writing async iterators but it seems like the best way of doing this when it comes to controlling the pace at which the results are provided for processing.

metabench commented 1 year ago

@marook Any advice on what my next move should be?

I feel like I could make a large comment here with the iterator. The problem with my current code is that it assumes only 1 primitivegroup per block. I'm also in the habit of releasing code under the MIT license, it seems very compatible with the licence of this project.

Regarding blocking until a called promise resolved - that sounds like it could be a good way to solve this sort of problem too. I suggest though that they are different interfaces with different syntax, and from the looks of things the async iterator can make more concise code than using multiple callback functions. It happens to be the interface I have wanted to make, and also looks as though it would be useful to this project.