Closed cycomachead closed 6 months ago
We can't just
Sync all existing project notes.
by running a script that iterates over 2.000.000 projects and parses them.
Ideas?
I see no reason why we can’t.
It’s a background task and low on resources. And since it’s long running at the point which you are about to update the DB you add a check just to make sure an old project hasn’t been touched in the past hour.
Plus it’s only the projects pre-migration. Since the migration, every project save from Snap! has sent the notes over, so they should have already been in sync since then. (I think I meant to edit the description to clarify that.)
-- Michael Ball From my iPhone michaelball.co
On Jan 12, 2019, at 11:53 PM, Bernat Romagosa notifications@github.com wrote:
We can't just
Sync all existing project notes.
by running a script that iterates over 2.000.000 projects and parses them.
Ideas?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
It doesn't matter if they're in sync or not. You can't know unless you read the project from disk, so you'll still have to iterate over the 2M projects...
If you write the script, just remember to garbage collect every time you load a big project (>5Mb).
But we do know that all “recent” projects are in sync because it’s impossible to update the notes in the XML without going through Snap! (Excluding any other api use that hasn’t happened yet).
Also I have to check but I think this should be easy to do in batches because you can check unprocessed projects for NULL project notes (compared to ‘’).
I know nothing about databases anymore, but back in the day we would make an index. One problem was they were blindingly fast but also take up lots of resources. At times our indices would take up more disk space than our "regular" data, but it was worth it for the stuff we were doing. I'm not sure whether it's the same here. Another approach would be to embrace pagination and combine that with a timeout threshold. As long as we have at least one persistent key for every project (which we now do have, right?), we can just find the first - say - 10 matches and then offer the user to look for "more". The timeout takes care if it's either taking to long or we're through searching.
Indexes do take up some space, but not that much -- at least not with Postgres. (Which, random fun fact was also developed at Berkeley initially!) In this case, we do have an index for projects on username and getting the data from the database itself is quite quick.
The issue with the project notes is that they exist in 2 places, both the XML and in the database. I'm confident we can make it such that we no longer need to read the XML to get the notes. \
The way I see it is this:
when saving a project from Snap!, we send the notes and the XML to the server. (This is already done). In theory if the notes are not present, we could extract them from the XML...but that's an enhancement that only matters when we have other API clients.
when updating notes on the Social site we can update both the XML and the DB. If updating the file in real-time becomes an issue, we can only update the DB.
It's worth noting that because you can edit notes in Snap! and on the social site, it will be possible to update notes in 2 places and create sync issues. We can resolve this later (simply by checking updated timestamps) or by having Snap! check for notes before save. Either way, it's a minor thing.
Re-indexes and space and efficiency. Indexes on numbers (i.e. normal IDs) are generally more performant than indexes on strings (usernames). If we really need to eek out more performance we could change the foreign key from a username
to a user_id
. (The downside is that this requires an extra join, which needs to look up the user_id, so we're talking minor differences until we get to many many more projects and users.)
Also, the speed of the database can be greatly improved by adding additional hardware and tuning resources. We haven't done much testing or tweaking here, but that's a very viable option. Plus, if we get support from SAP then using a separate server for a DB will help make that performance a bit more consistent.
So, I ran some numbers, and this is good. We can very easily do this in batches because we do distinguish between null project notes and empty strings!
snapcloud=> SELECT COUNT(*) FROM projects WHERE notes is NULL;
count
---------
1470248
(1 row)
snapcloud=> SELECT COUNT(*) FROM projects WHERE notes = '';
count
--------
886012
(1 row)
snapcloud=> SELECT COUNT(*) FROM projects WHERE notes <> '' and notes is NOT NULL;
count
-------
20169
(1 row)
snapcloud=> SELECT COUNT(*) FROM projects;
count
---------
2376430
(1 row)
(Those numbers do add up to an off-by-one error, but I think that's because someone just saved a new project. 😁)
What this also means is that going forward, we could always use the null-vs-empty-string convention and choose to only read files dynamically if the values are null.
OH! So, I think @bromagosa actually did most of what we need to do!! 😀 -- and something I was going to suggest as an alternative.
The updatingnotes=true option reads from the disk AND when it finds notes it also updates the database! So, for each user, there should only be 1 slow request, then the rest should be fast.
There are two problems with this:
if (project.notes == nil or project.notes == '') then
. We should remove the second check. Notes in the DB as ''
represent a project whose notes we know are empty. nil
when it should return an empty string. I haven't yet found exactly why, but once I figure that out, I'll make a fix for both. After that, I think we can leave the lazy notes generation on for a while, but eventually we could migrate old projects after waiting a bit -- it should be an interesting test to see how quickly that 1.4 million projects drops.
I was looking at this today and we're down to only about 1.38 million projects that still have null notes. It seems like most of these are probably users who were students and unlikely to log in again, but all recently created projects (well since the beginning of this year) are working as designed.
Tidying issues and closing this. :)
https://snap-analytics.cs10.org/question/193-usernames-count-projects-with-no-synced-notes There are a small handful of projects which were not migrated, but they are users which have deleted their accounts.
I think it makes sense to sync existing notes in the DB. Right now, when you go to open your list of projects, the app has to read and parse all your projects.
Loading the project list seems quite a bit faster w/o needing to read all the notes. (About 1/10 the time, on my project list, though not totally 10x faster in Snap!.)
This requires keeping 2 things in sync, which is a bit annoying, but I think worth it, given that the current situation involves reading (and eventually) writing the XML files for updating notes.
I think there's 3 changes: