noisebridge / MediaBridge

2 stars 3 forks source link

Architecture: where are we storing our movies? #3

Closed audiodude closed 2 months ago

audiodude commented 3 months ago

We need some kind of data storage for the movies that we pull from the Netflix data set, like in the process described in #2

We could use a relation database:

  1. MySQL/MariaDB
  2. PostgreSQL

Or maybe a no-SQL database:

  1. MongoDB

And maybe some kind of vector database?

cocomittens commented 3 months ago

I'm studying for a system design interview right now I feel like I should be able to answer this, but idk.

Like I'd imagine storing the media data itself, it would be a large amount but probably wouldn't increase exponentially or anything (would be limited by the rate at which new things come out), so not sure if it would need to be highly scalable (which I believe NoSQL would be better for). However, there could theoretically be a large number of users or something else that might need to scale significantly at some point, though I'm not sure that's an immediate concern.

Depending on what media types there are, I'm not sure what the schema would look like. I would think a lot of things would be the same regardless (title, rating, creator(s), genre, etc). But with so many different types there may be different media specific attributes that a dynamic schema would be better suited for. So kinda leaning towards NoSQL for this, but unsure.

And then I'm not sure ACID properties are necessarily super important, like eventual consistency is fine. Performance is more important I would say. So probably don't need SQL for that. But there's probably a lot of complex queries/ relationships between the data that SQL might be beneficial for.

Overall, I'd lean towards NoSQL, although I like Postgres. But there could also be a scenario where SQL could be beneficial as well. And like ultimately both would work (unless we need a dynamic schema, or there is like some kind of relational thing we can't do with NoSQL), and maybe we just want to learn/ practice SQL? However this could change depending on what the vector database is used for. I'm only vaguely familiar with what they are, but don't know enough to evaluate tradeoffs etc. But from what I understand that could definitely be useful for this for sure, and plus just want to learn more about them, so definitely down to use one of those as well.

(Also, I realize I'm assuming that this would just be like one general Media table with like a type property of music, movies, books, etc. The dynamic schema thing probably wouldn't be necessary if we had separate tables for each media type. In that case I might slightly lean toward SQL)

audiodude commented 3 months ago

One thing we also discussed briefly in the meeting on Thursday was that we might have multiple databases:

  1. Our "media store" where we have entries for every movie, book, etc
  2. The "application database" where we store user profiles and information related to serving the frontend/website

I think for the media store, MongoDB makes a lot of sense. There is only one "table" in this database: the items/media. So there's nothing to JOIN against, no need for relational tables. It will be easier to stash and retrieve Python dictionaries to and from Mongo than writing SQL. And like you pointed out, having a flexible/dynamic schema could be very beneficial (no schema migrations).

cocomittens commented 2 months ago

Yea I totally agree, I think that makes sense. What would the vector database potentially be used for? Was that like an alternative option for the media store or application database, or something else? And then what would we use for the application one? I feel like SQL could maybe work for that, as the schema would probably be pretty fixed, and maybe just to use something different? Or maybe it would be easier to use the same thing for both of them.

audiodude commented 2 months ago

A vector database I believe is used for AI processes like Retrieval Augmented Generation (RAG). Not clear if we need one at this point, though I've also read that Mongo can act as one as well.