Migrate Existing Project Notes To DB

cycomachead commented 6 years ago

I think it makes sense to sync existing notes in the DB. Right now, when you go to open your list of projects, the app has to read and parse all your projects.

Loading the project list seems quite a bit faster w/o needing to read all the notes. (About 1/10 the time, on my project list, though not totally 10x faster in Snap!.)

This requires keeping 2 things in sync, which is a bit annoying, but I think worth it, given that the current situation involves reading (and eventually) writing the XML files for updating notes.

I think there's 3 changes:

Sync all existing project notes.
when saving a project, we extract the notes in addition to the thumbnail
- IMO, we could parse the XML when passed w/o a notes parameter, but we shouldn't do this until necessary.
we update the DB and the XML file when updating notes from the front end
reading project metadata, no longer needs to load files

bromagosa commented 5 years ago

We can't just

Sync all existing project notes.

by running a script that iterates over 2.000.000 projects and parses them.

Ideas?

cycomachead commented 5 years ago

I see no reason why we can’t.

It’s a background task and low on resources. And since it’s long running at the point which you are about to update the DB you add a check just to make sure an old project hasn’t been touched in the past hour.

Plus it’s only the projects pre-migration. Since the migration, every project save from Snap! has sent the notes over, so they should have already been in sync since then. (I think I meant to edit the description to clarify that.)

-- Michael Ball From my iPhone michaelball.co

On Jan 12, 2019, at 11:53 PM, Bernat Romagosa notifications@github.com wrote:

We can't just

Sync all existing project notes.

by running a script that iterates over 2.000.000 projects and parses them.

Ideas?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

bromagosa commented 5 years ago

It doesn't matter if they're in sync or not. You can't know unless you read the project from disk, so you'll still have to iterate over the 2M projects...

If you write the script, just remember to garbage collect every time you load a big project (>5Mb).

cycomachead commented 5 years ago

But we do know that all “recent” projects are in sync because it’s impossible to update the notes in the XML without going through Snap! (Excluding any other api use that hasn’t happened yet).

Also I have to check but I think this should be easy to do in batches because you can check unprocessed projects for NULL project notes (compared to ‘’).

jmoenig commented 5 years ago

I know nothing about databases anymore, but back in the day we would make an index. One problem was they were blindingly fast but also take up lots of resources. At times our indices would take up more disk space than our "regular" data, but it was worth it for the stuff we were doing. I'm not sure whether it's the same here. Another approach would be to embrace pagination and combine that with a timeout threshold. As long as we have at least one persistent key for every project (which we now do have, right?), we can just find the first - say - 10 matches and then offer the user to look for "more". The timeout takes care if it's either taking to long or we're through searching.

cycomachead commented 5 years ago

Indexes do take up some space, but not that much -- at least not with Postgres. (Which, random fun fact was also developed at Berkeley initially!) In this case, we do have an index for projects on username and getting the data from the database itself is quite quick.

The issue with the project notes is that they exist in 2 places, both the XML and in the database. I'm confident we can make it such that we no longer need to read the XML to get the notes. \

The way I see it is this:

when saving a project from Snap!, we send the notes and the XML to the server. (This is already done). In theory if the notes are not present, we could extract them from the XML...but that's an enhancement that only matters when we have other API clients.
when updating notes on the Social site we can update both the XML and the DB. If updating the file in real-time becomes an issue, we can only update the DB.
- then when you download a project, or open it in Snap!, we just update the notes from the DB.
It's worth noting that because you can edit notes in Snap! and on the social site, it will be possible to update notes in 2 places and create sync issues. We can resolve this later (simply by checking updated timestamps) or by having Snap! check for notes before save. Either way, it's a minor thing.

Re-indexes and space and efficiency. Indexes on numbers (i.e. normal IDs) are generally more performant than indexes on strings (usernames). If we really need to eek out more performance we could change the foreign key from a username to a user_id. (The downside is that this requires an extra join, which needs to look up the user_id, so we're talking minor differences until we get to many many more projects and users.)

Also, the speed of the database can be greatly improved by adding additional hardware and tuning resources. We haven't done much testing or tweaking here, but that's a very viable option. Plus, if we get support from SAP then using a separate server for a DB will help make that performance a bit more consistent.

cycomachead commented 5 years ago

So, I ran some numbers, and this is good. We can very easily do this in batches because we do distinguish between null project notes and empty strings!

snapcloud=> SELECT COUNT(*) FROM projects WHERE notes is NULL;
  count  
---------
 1470248
(1 row)

snapcloud=> SELECT COUNT(*) FROM projects WHERE notes = '';
 count  
--------
 886012
(1 row)

snapcloud=> SELECT COUNT(*) FROM projects WHERE notes <> '' and notes is NOT NULL;
 count 
-------
 20169
(1 row)

snapcloud=> SELECT COUNT(*) FROM projects;
  count  
---------
 2376430
(1 row)

(Those numbers do add up to an off-by-one error, but I think that's because someone just saved a new project. 😁)

What this also means is that going forward, we could always use the null-vs-empty-string convention and choose to only read files dynamically if the values are null.

cycomachead commented 5 years ago

OH! So, I think @bromagosa actually did most of what we need to do!! 😀 -- and something I was going to suggest as an alternative.

The updatingnotes=true option reads from the disk AND when it finds notes it also updates the database! So, for each user, there should only be 1 slow request, then the rest should be fast.

There are two problems with this:

The condition for checking if notes need to be generated is slightly wrong: if (project.notes == nil or project.notes == '') then. We should remove the second check. Notes in the DB as '' represent a project whose notes we know are empty.
The second issue -- the XML function for parsing notes is returning nil when it should return an empty string. I haven't yet found exactly why, but once I figure that out, I'll make a fix for both.

After that, I think we can leave the lazy notes generation on for a while, but eventually we could migrate old projects after waiting a bit -- it should be an interesting test to see how quickly that 1.4 million projects drops.

cycomachead commented 5 years ago

I was looking at this today and we're down to only about 1.38 million projects that still have null notes. It seems like most of these are probably users who were students and unlikely to log in again, but all recently created projects (well since the beginning of this year) are working as designed.

cycomachead commented 6 months ago

Tidying issues and closing this. :)

https://snap-analytics.cs10.org/question/193-usernames-count-projects-with-no-synced-notes There are a small handful of projects which were not migrated, but they are users which have deleted their accounts.

snap-cloud / snapCloud

Migrate Existing Project Notes To DB #125