Select a database solution for the OAM Catalog

cgiovando commented 9 years ago

The OAM catalog will store metadata for imagery datasets from the local instance and other OAM nodes. This metadata will be stored in a lightweight database solution, able to scale as needed by future versions of OAM. Key requirements are:

lightweight
portable and easy to install in different environments/OS
ability to store geographic data objects (e.g. GeoJSON)
accessible though APIs and Web scripting languages
versioning and disconnected editing

The catalog Web interface should be able to interact directly with the database without additional middleware or bindings. NoSQL type of databases may be more suitable in general, since we won't need advanced join or spatial query features.

Some existing projects and code that we should look into:

Let's discuss options using comments below.

MarkCupitt commented 9 years ago

I am only familiar with SqlLite, non of the others

Things to consider:

Database Max Size over years and likely performance issues, if any
Other uses for databases in the final system. Why implement more than one db technology For example, serving Polygons showing image boundaries would need a spatially enabled database, why not just use the one db for all the storage needs 3, Long Term Support for the Db, will it still be supported/developed in 20 years
Replication - can replication be implemented for scaleability and implementing backup strategies.
You should add Postgres to the list

Cheers

harry-wood commented 9 years ago

Seems like a fairly simple use case. Nothing too exotic going on, so I would tend to steer clear of the exotic options and pick SQLite. Conventional SQL database.

cgiovando commented 9 years ago

Thanks Harry and Mark - I'm also not familiar with the "exotic" options and would agree for more traditional SQL databases. It would be nice to have a comparison of the pros/cons in relation to our requirements from someone who has experience with both. I heard nice things about NoSQL for scalability and disconnected editing, but I have no practical experience.

PostgreSQL + PostGIS are probably excellent options for the main OAM server, but we also don't need all the geospatial functionalities that they offer. We are basically just storing dataset bounding boxes (and maybe footprints) in GeoJSON, with no needs for advances spatial queries. Unless we decide to store imagery tiles in database as well (instead of files/MBTiles), then we are probably good with something lighter and portable such as SQLite.

The other concern is in fact about portability, when we package everything into nix/win/mac/etc installers so that anyone going into a disconnected environment can easily start up a local OAM instance on a laptop, regardless of what operating system they are running.

MarkCupitt commented 9 years ago

Hmmm .. Disconnected Environment, now that may raise a LOT of issues, just with the volume of data alone. Thinking about an OSM/Hot type of environment, would there be any need to update a disconnected server with new data based on their own a activities in the field, ie will the distributed data change in any way? Possibly not as it is base data used to generate OSM or overlay data.

Is this a requirement for the system, offline storage? If so, need to think about capability to move data selectively between servers .. and the implications on network traffic. It may be better to consider moving original data then creating tiles on the local device as needed, tradeoff between network traffic and time to generate tiles at different zoom levels.

If the concept I proposed of Nodes being primarily tile caches, then it maybe a simple matter of prepopulating the cache on the node prior to deployment.

MarkCupitt commented 9 years ago

Regarding SQL Lite, you need to consider any limitations on database size and performance. If the catalog gets quite large and busy, will it scale. I have only ever used it for local storage on Client Machines so have never had to consider this aspect.

Possibly its locking schemes on updates vs querying data may be worth taking a look at for any potential issues on a busy catalog server

MarkCupitt commented 9 years ago

Another Thought on Databases is Multi-Catalog Server Synchronization ..

There are two approaches, build in the functionality in the Catalog Application to push/pull changes

or

Implement PostGres/Postgis and do it at the database level, no software required.

Not sure if any of the other databases offer this option

wonderchook commented 9 years ago

@MarkCupitt regarding the disconnected environment it would need need to mirror all of the imagery of the entire project. It would need to be able to locally process and catalog items there. Imagine flying UAV imagery after a disaster and needing to process and use it. Being able to see what is available (mirroring the catalog) however would be useful then the user could choose to go and download the imagery relevant to them.

MarkCupitt commented 9 years ago

@wonderchook that is a good usage example. It came up as an issue after Ruby when some Drones were finally able to get permission and position to fly. The issue became how to get the imagery out of the area, ended up hand carrying on a usb stick was the only way because of the size of the files.

So as an example, if a nice and fast notebook was configured as an OAM Node and in a disaster area running on whatever local network was available, it could catalog and store imagery locally for people in the area and serve it for use in local mapping, etc. Secondly, if outbound connectivity was available, hand carry or upload, that same imagery could be made available world wide.

So an Export/Import capability via fixed media (Usb stick, etc) seems needed??

CloCkWeRX commented 9 years ago

Sync/offline capabilities can be quite challenging, and I'd strongly suggest this is given a lot more careful consideration. Would it be worth splitting the conversation off to another issue?

WRT offline/sync; what's the most important bit to sync, the metadata or the imagery? And are these concerns actual database concerns; or more generic data layer concerns?

On capturing new imagery or downloading it in the field, the idea that using something like bittorrent sync/similar approach seems plausible - it supports things like resuming file uploads, checksums for small parts, and automatically gives you fast-ish replication if you have enough people seeding data. You do run into possible issues with restrictions on connectivity at a provider level, which is a downside, but for the most part, you don't need rich well structured data here, you just need binary info. Comparing that to HTTP, FTP; it has a lot of benefits especially if data connectivity can be flakey.

That leaves you with the challenge of syncing the actual metadata about the raw imagery files or tiles from A to B, which is considerably easier - it seems like you'd have to sync little blobs of (title, data set identifier, description, creation_date, multipolygon of extents); and not much more.

It's only the metadata sync problem that becomes a database replication problem; and potentially just adding a CRUD REST API for that part makes the actual DB choice irrelevant. IE: CKan's datastore is just elasticsearch, which out of the box provides CRUD for documents - a simple proxy to do authentication and validation of data is all you'd need.

That however is a very different discussion to "is it suitable to use sqlite, or postgis, etc".

smit1678 commented 9 years ago

Hey all - based on our initial direction, we're going to start working with MongoDB. For production searching, Elasticsearch may be used to back the db. Still needs to be tested out. Tracking the code here: https://github.com/hotosm/oam-catalog.

hotosm / OpenAerialMap

Select a database solution for the OAM Catalog #6