taa176 / web

Website. Welcome.
0 stars 0 forks source link

Storing the database #5

Closed zeumai closed 5 years ago

zeumai commented 5 years ago

Hey, guys. I've been looking into this, and it seems like the most GitHub-friendly approach to storing databases in just to store the schema (tables, relationships, stored queries, etc.) without the actual data. This is why the replace_db branch includes a text file ('db_script.sql') rather than a database directory; the idea is that the database should be rebuilt from this script when you run server.js. Obviously this is provisional, but I think it's the best approach until we actually put the site on the web. Let me know if you disagree.

Just to keep you apprised of my progress: I've built a database with two tables, one for users (username, email, hashed password, plus I'll want to add salt back in, it sounds like), and one for chat messages (sender, recipient, sent time, message body). I've done a bit of testing (adding and deleting users and messages) just to make sure that the indexes and relationships are working as I'd expect them to. I haven't used MySQL before, so I don't want to make too many assumptions. The db_script.sql file is a text dump (made using MySQL Workbench) of this database's schema as a series of SQL statements.

Right now I'm trying to figure out how to get a mysql connection to execute a .sql file. I'm not sure exactly how to proceed. I'm sure there's another package I could add that would just handle it for me. 'execsql' looks promising. On the other hand, I could read the db_script.sql file and run the 'query' function on each statement individually. I'll need 'fs' for that, I think. However we decided to handle it, I think I'll need to add additional package dependencies on top of mysql. How okay are you two with that?

taa176 commented 5 years ago

It's fine with me if you need another package - go for it.

I have a few questions about the DB. So, I think what you say about having the schema/text file and rebuilding it sounds like the right thing to do. I'm wondering about a few other things, though.

a) It seems like it should only be rebuilt once. I.e. we check if it exists, and if not then we rebuild. So that would mean it would exist on disk (hard drive) when the server isn't running. Is that what you were thinking?

b) Since accessing hard drive data takes time, should we load in some/all of the database into RAM at the beginning? Did you have any thoughts on when the loading would take place/how often?

zeumai commented 5 years ago

On point a), I actually have the script set up to delete an existing database of the same name, if one exists. It would be straightforward to set up the opposite behavior (i.e. cancel the creation of the new database if an earlier version exists), but I'm not sure we'd want to; otherwise, it seems to me that we'd have no way of editing the database and seeing our changes reflected the next time we run the code. On the other hand, once we reach a point where you think the database is doing everything that you need it to do, I think it makes sense to stop editing it altogether and let our disk versions persist. What do you think about that?

On b) -- what data is and isn't in temporary memory is going to be handled by the storage engine, in this case InnoDB. We do have control over the choice of engine, and there are a couple engines designed for this purpose: [https://dev.mysql.com/doc/refman/8.0/en/memory-storage-engine.html]. That said, I think InnoDB is the way to go. The way a storage engine handles RAM vs. disk storage is closely tied to its ability to execute transactions and ensure data integrity. Especially once we have the site online, I think the best way to reduce the impact of a crash is to use an engine like InnoDB that commits changes to disk once per transaction.

My main concern when it comes to performance would be the message table, where we'd want to save every message once it's sent. The high number of concurrent users could slow querying down. I really don't think we'll have a big problem with that, though; WordPress blogs, for instance, typically use an engine that's actually less performant for writes than InnoDB.

zeumai commented 5 years ago

Not that WordPress is maybe the highest standard in the world.

taa176 commented 5 years ago

Okay, yep a) seems good to me.

How much of the DB engine do you control? So, for the message table, could we save the messages in batches? This is completely unimportant for now, just wondering how flexible the database engines are in general.

zeumai commented 5 years ago

Yeah, we can save messages in batches. I think the best way to do that would be to use client-side tables to stage the data. My understanding is that database engines are pretty inflexible by design, but I could be wrong. It looks like there are some ways to optimize InnoDB for particular common tasks, so we have at least some level of control.

k-hendricks commented 5 years ago

Are you proposing saving all the chat messages forever? And, if so, can you move them to a different database every once in a while to save on lookup time?

On Sun, Sep 15, 2019, 09:52 zeumai notifications@github.com wrote:

Yeah, we can save messages in batches. I think the best way to do that would be to use client-side tables to stage the data. My understanding is that database engines are pretty inflexible by design, but I could be wrong. It looks like there are some ways to optimize InnoDB for particular common tasks, so we have at least some level of control.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/taa176/web/issues/5?email_source=notifications&email_token=AEKJW6K33JUW44AQ272NEGDQJZR4XA5CNFSM4IWYZ5S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6XUR5A#issuecomment-531581172, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKJW6JJ2BSQFCL2UVIKHOTQJZR4XANCNFSM4IWYZ5SQ .

zeumai commented 5 years ago

If we do save them forever, we should definitely archive older messages in another table. That's very doable. Would you prefer not to save all the messages, though?

k-hendricks commented 5 years ago

I don't have an opinion either way; I thought that was what you were saying.

On Sun, Sep 15, 2019, 11:12 zeumai notifications@github.com wrote:

If we do save them forever, we should definitely archive older messages in another table. That's very doable. Would you prefer not to save all the messages, though?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/taa176/web/issues/5?email_source=notifications&email_token=AEKJW6PWD4SMAWDANP6KM3LQJZ3HNA5CNFSM4IWYZ5S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6XWAMQ#issuecomment-531587122, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKJW6NN4UC6EKREQZPZ4RTQJZ3HNANCNFSM4IWYZ5SQ .

zeumai commented 5 years ago

It was what I had in mind. I don't think it's a bad idea, but maybe it's unnecessary or even gratuitous?

zeumai commented 5 years ago

Since the site is ultimately supposed to help with tutoring, I think having a permanent record of chats makes sense.

k-hendricks commented 5 years ago

No reason to just throw out data when its so easy it keep it

On Sun, Sep 15, 2019, 11:18 zeumai notifications@github.com wrote:

It was what I had in mind. I don't think it's a bad idea, but maybe it's unnecessary or even gratuitous?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/taa176/web/issues/5?email_source=notifications&email_token=AEKJW6JHSFJZLTSKPXCLQXLQJZ4ARA5CNFSM4IWYZ5S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6XWEMY#issuecomment-531587635, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKJW6O3PQAEPKSPR4OKESDQJZ4ARANCNFSM4IWYZ5SQ .

zeumai commented 5 years ago

Agreed.

zeumai commented 5 years ago

I'm closing this, but Tyler: I ended up going with your initial suggestion after all. Just rebuilding the database once (and then only as needed after that) seems like the best way to keep the rebuilding code isolated from the rest of the project, which is ideal since we'll be getting rid of it anyway.