ipfs / devgrants

The IPFS Grant platform connects funding organizations with builders and researchers in the IPFS community.
165 stars 75 forks source link

[TRACKING] Open Street Map #59

Open parkan opened 4 years ago

parkan commented 4 years ago

Tracking for https://github.com/okdistribute/devgrants/blob/25eb4682b305189121b8ae35472bf86a06bf59e5/targeted-grants/open-street-map-ipfs.md

@okdistribute do you have updates on this project?

ghost commented 4 years ago

We have the database working in the browser now on the wasm-async branch:

https://github.com/peermaps/eyros/blob/wasm-async/pkg/main.js

db

There is still some work to do making the file format more amenable to updates over ipfs since an appended-to file gets a new hash, so I am working on making a file-based linked list format for more efficiently transmitting updates of map data from an existing partial cache.

I'm also working on a scheme to ensure that ingested blocks end up in sufficiently geographically compact pages. Right now the block scheme is sensitive to the locality of ingested features, but there is an opportunity during tree rebuilding and block creation to perform swaps so that the regions expressed by blocks are more compact and the tree is more balanced. This is important so that clients will receive pages local to the bounded region they are looking at.

okdistribute commented 4 years ago

Thanks @parkan and @substack for the update! We're currently working on Deliverable 1 "v1 of the peermaps/ingest library in Rust"

okdistribute commented 4 years ago

I just pushed up a first pass of georender-pack in Rust, which encodes pbf files into a smaller format, currently called peermaps/bufferschema.

Next step is to write some more tests to make sure it's compatible with the existing Node.js implementation, and then integrate it into peermaps/ingest which will be the commandline tool for converting pbf files into an on-disk format. Then I'll look at the ipfs and filecoin apis to see how to best push the files onto the network and record usability notes along the way.

thibaultmol commented 3 years ago

Maybe get into contact with the osm ops team. They might be able to help as well to make this happen https://twitter.com/OSM_Tech/status/1308465290263629825 They've been running the official osm tiles server and are seeking help for hosting. IPFS hosting would be a great solution.

(ps: If it'd be possible to do ipfs clustering or whatever the final solution will be. Please make it something that it's possible to configure the ipfs node to only pin a section of it. So that people who want to contribute some bandwidth and storage but only want to dedicate x amounts of GB's have that option)

Firefishy commented 3 years ago

I am the person who put the @OSM_Tech tweet out. The tile CDN renders on demand, for tile layer 0 to maybe zoom 14 the cache hit ratio is high, but decreases rapidly to zoom 19 where the majority of requests are rendered on demand (~1% hit ratio). This tool is handy for working out the number of map tiles there are: https://tools.geofabrik.de/calc/#type=geofabrik_standard&bbox=-144.227563,-75.111889,233.16404,81.381733 and the likely 3TB storage required over 16.5 billion tiles.

okdistribute commented 3 years ago

Hi all! Thanks for chiming in with all this interesting information and links to existing conversation surrounding maptiles. Just want to give a status update --

We're doing preliminary testing on small metro areas and working on improving the performance for parsing and packing the full planet osm into the peermaps format. We're a bit behind schedule and it's hard to say if we'll have a full working demo of the full planet parsed by the end of the year; the goal is to get it from OSM dump to updated on the peer network daily, but that takes a significant lift of algorithms work.

autonome commented 3 years ago

@okdistribute happy new year! Can you share where things are at?

okdistribute commented 3 years ago

Hi @autonome !

We have made good progress on the ingest cli tool, and the code can be found here. We are blocked currently on a full demo with planet.osm, as the eyros database needs some more bug fixes and tweaks from @substack, including:

However, I have enough to work with the current version of the database to compile a report about feasibility and usability when building on both IPFS and Filecoin, just on a smaller metro area. I'll be working on that over the next two weeks (less than part-time) and then report back

okdistribute commented 3 years ago

Hi @autonome , @substack is still working diligently on the database but it's almost ready for primetime. The peermaps ingest pipeline has processed 6.04 billion records in 84h39m, and counting.

Once this is complete the team will create some demos, put it on the network, and finish up the grant. Thanks so much for your patience during this as the team worked out the kinks with the underlying database.

autonome commented 3 years ago

6.04 billion 😱

whyrusleeping commented 3 years ago

@okdistribute is there a way I can pull a copy of the dataset? I have sufficient disk space and bandwidth.

ghost commented 3 years ago

It's not quite ready yet. I had to restart the ingest process a few times since it's run out of RAM and was too slow, but I think it's getting close to finishing in a reasonable about of time finally. I will post the ipfs hash here as soon as we have the ingest data processed, and I'm also working on some changes to the db to improve query performance which also involves the ingest process.

whyrusleeping commented 3 years ago

@substack oh cool, how are you doing the ingestion? Im super curious

whyrusleeping commented 3 years ago

(reading through https://github.com/peermaps/ingest now)

ghost commented 2 years ago

We have the data processed and imported into ipfs now at /ipfs/QmVCYUK51Miz4jEjJxCq3bA6dfq5FXD6s2EYp6LjHQhGmh for the first (actually second) version. The query performance should be really good now, I tested over http and loading part of a city only takes ~1MB of transfer from a cold cache. I'm working on hooking up the web frontend and rendering stack next.

Apologies for the delays, it was fairly difficult to get the processing to finish in a reasonable amount of time on a single machine even with 60GB of ram. And then it took additional time to optimize the network transfer size for queries, but it's now down to about 1M from 250M initially.

ghost commented 2 years ago

I made a very unpolished demo running at https://ipfs.io/ipfs/QmSGA2eRYawmensn4DVz2WNt1zTwCG8TeWsxT3LhX4wxjr/ but walking the directory tree is very, very slow. With the default viewbox, it should only need to pull down ~40 files and a total of ~1.3M from the network but I haven't yet gotten more than several files to resolve without timing out.

The peermaps output directory is seeded by a machine in a datacenter with very good network, cpu, and disk, so I'm not sure why things are so slow but perhaps 2.6 million separate files is infeasible to seed and announce onto the DHT from a single machine. There should also be at least one other seeder. Does anyone know what the limits to hosting this type of data are? When I fetch the files from that server over http, the map loads pretty quickly.

ghost commented 2 years ago

After some more research, I've identified a huge bottleneck which is that requests execute serially with eyros compiled to wasm and I expect this change will significantly speed up fetching especially when there is significant latency.

ghost commented 2 years ago

performance is much better with this version: https://ipfs.io/ipfs/QmS24zmgDz2jFdakvd6aT6sRXSGRXWJaB62aPTbvmpguBB/

requests in parallel and cancelling requests not currently in viewbox area

ghost commented 2 years ago

This is an updated version from October 15 that shouldn't crash on chrome (which requires streaming wasm parsing):

https://ipfs.io/ipfs/Qmb4eNFHMoDpd5qxZsLuHxfTSwvzYnR6ReUtXqESEeD2aW/#bbox=-149.91,61.213,-149.89,61.223

We're working to fix and polish the rendering and get zoomed out versions to work with natural earth data and filtered osm geometry. I'll post a link when those features are ready.

autonome commented 2 years ago

@substack @okdistribute hi! the updates last quarter are exciting and inspiring. would be great to chat about how to wrap up this phase, and talk about follow-on work.

thibaultmol commented 2 years ago

Might be interesting to mention here: https://phabricator.wikimedia.org/T187601#7642399 Basically this guy has maintained a very popular osm tile server, but didn't really get much compensation for it. So he's shutting it down unless someone takes it over. Resulting in a lot of sites looking like this atm: image

okdistribute commented 2 years ago

We're really happy to announce that we've completed the deliverables of this grant and have a working demo of peer-to-peer maps!

See peermaps/ingest which is a library for ingesting and diffing OpenStreetMap data, handling periodic changes. This tool converts OpenStreetMap pbf into the Peermaps format.

We also released a more 'all-in-one' commandline tool to diff .pbf files, convert to the Peermaps format, and pin both data formats on IPFS.

Here's the demo: https://ipfs.io/ipfs/Qmb4eNFHMoDpd5qxZsLuHxfTSwvzYnR6ReUtXqESEeD2aW/#bbox=-149.91,61.213,-149.89,61.223

And we have a blog post up and coming!

ghost commented 2 years ago

There is now a second archive for processed natural earth data for zoomed-out views:

https://ipfs.io/ipfs/QmY1Ggv8EZT2973nNwjMB4rarUxiCgkAEADoiczWekpayq

https://ipfs.io/ipfs/QmTes6hYgCZzcP4oXjyvT48UwmzfqvWApXHs77sTnJV7fq

https://github.com/peermaps/data#natural-earth-vector-eyros-georender

Now I'm working on an additional view to bridge the gap from 10m natural earth to the full planet-osm data.

Tracking the addresses in https://github.com/peermaps/data and later in ipns too.

autonome commented 1 year ago

For folks interested in decentralized spatial...

A WG here, who holds ~monthly talks: https://easierdata.org/dgwwg

Can sign up for dGWWG updates and talk announcements here: https://zc.vg/Dck0z

autonome commented 9 months ago

Dropping some comments here that were in gdoc, so we don't lose them.

IPFS peermaps experience

The peermaps dataset is generated from planet-osm.pbf, a binary mult-GB file of OpenStreetMap data with inline compression. The OSM features are denormalized and converted into a format that is easier to render, and each record is written into a spatial database.

The spatial database is stored as a collection of small files that can link to each other. These files are very often less than 1 megabyte in size. The peermaps database is 100 GB split across 2.6 million files.

Using ipfs add on the directory structure finished in 4.8 hours which is not very much time considering it took ~4 days to process the data from planet-osm.pbf. The resulting storage overhead was also not very much and a single machine appears entirely capable of seeding millions of hashes.

The biggest limitation has been the retrieval time for content. Using a local ipfs node and ipfs companion browser extension with an empty cache, loading a new area of the map with this took could take several minutes:

https://ipfs.io/ipfs/QmS24zmgDz2jFdakvd6aT6sRXSGRXWJaB62aPTbvmpguBB/

But once the cache is primed and the data is available locally, the results are very fast. The results are similar but slightly better using an ipfs http gateway: several minutes to fetch a new area. Each request has a huge amount of latency, and multiple requests are required to walk the tree structure used by the spatial index.

Running an http server on the vps that is seeding to ipfs, the map loads much more quickly, at a similar speed to other web maps.

Given these performance findings, it seems best for now to use ipfs for non-interactive bulk-loading purposes and not realtime rendering purposes. A web client could use a list of user-configurable http gateways that are known to have the entire peermaps dataset cached and those gateways could use tipfs to fetch the initial content and seed the dataset.

It may be possible to use libp2p to drive ipfs manually to fetch data more directly from peers which are known or highly likely to have files in the peermaps dataset. This manual strategy might have similar latency as fetching from a known list of http servers.

Installing lotus was not easy to try ingesting data into the Filecoin API. Would be great if there was a prepackaged executable. We had to use the Filecoin CLI, because of the size of the dataset. The commandline tool is called 'Lotus' and it needs to be built from source.