jonrh / lambda-lovelace

🎓 MSc project, recommender system for tweets
https://jonrh.github.io/lambda-lovelace/
ISC License
4 stars 0 forks source link

Find suitable databases #29

Closed jonrh closed 8 years ago

jonrh commented 8 years ago

In this issue I note in a ad-hoc way things of relevance to selecting our database. It can be summarised as follows: I think our most suitable databases are RethinkDB or PostgreSQL (or some combination of those).

Requirements

We require a database for two main purposes:

What ever our selection will be there must be good support for writing our DB interfacing code in Python.

Potential document DB candidates:

Potential relational DB candidates:

This issue tracks notes on what data stores might suit us best.

http://www.stackoverkill.com/ranking/sql-nosql

By mentions on Stack Overflow MongoDB dwarfs the other NoSQL DBs. Next up comes Redis and Cassandra.

List of document databases on Wikipedia: https://en.wikipedia.org/wiki/Document-oriented_database#Implementations

Popular NoSQL databases by GitHub stars

  1. Redis: 18,590
  2. RethinkDB: 14,419
  3. MongoDB: 9,558
  4. PouchDB: 6,462
  5. Titan: 3,791
  6. FlockDB: 2,875
  7. NEO4J: 2,749
  8. CouchDB: 2,572

and more

Compose https://www.compose.io/

Public cloud service to host various databases:

Didn't dive deep into it. A share-nothing document database that seems to be designed for containerised environments (Docker).

https://en.wikipedia.org/wiki/CrateIO https://crate.io/

ElasticSearch

Originally I thought this was only a search engine but apparently it's a document database as well. The official website contains only ambiguous and general marketing material.

This video gave me a pretty good quick start picture of what ElasticSearch is and how you do basic queries: https://www.youtube.com/watch?v=60UsHHsKyN4

RethinkDB

A realtime document database that pushes changes to clients. Native driver languages: Ruby, Python, Java, and JavaScript. From what I've read I've only good impressions. Documentation is good, second most starred database on GitHub, after Redis.

Probably not what we are looking for. Redis is a fast in-memory key-value store, not a document database.

Relational Databases

On first glance we think that a relational database might be the best option to store the results from user evaluations as well as interaction observations.

Originally we aimed for a document DB to store the tweet cache (JSON). However on further investigation it seems relational databases have been adding support for JSON over the years. If the support is good enough it would probably be best to use one database instead of two.

PostgreSQL

PostgreSQL has had JSON support since version 9.2. In version 9.5 support was added to modify JSON in place.

JSON data type: https://www.postgresql.org/docs/9.5/static/datatype-json.html JSON functions: https://www.postgresql.org/docs/9.5/static/functions-json.html

Resources

Rob Conery - Document Storage Techniques with PostgreSQL and JSONB

Link: https://www.youtube.com/watch?v=rg_GiOZ5Owk

Summary: With recent addition of JSON support to PostgreSQL it has now acquired one of the main benefit of NoSQL databases, get started fast. He takes an example that the appeal of NoSQL is that you don't have to worry about the schema'a right away. To get going fast, just add a column or a table with JSONB (JSON Binary representation) and store the information you need. As time goes on and you need to access the data you slowly normalise the data into tables over time.

Is PostgreSQL Your Next JSON Database?

Link: https://www.compose.io/articles/is-postgresql-your-next-json-database/

Summary: PostgreSQL is nice if we are not changing the JSON a lot. If we are, a document database would be more suitable. According to this blog you can't update JSON fields in place, you have to have some external (Python) code to take it out, add it, then dump it all in again.

MySQL

Todo.

References

Document databases, Wikipedia: https://en.wikipedia.org/wiki/Document-oriented_database

jonrh commented 8 years ago

RethinkDB or PostgreSQL.