re-architect content fetching

mitmedialab / gobo

💭 Gobo: Your social media. Your rules.

https://gobo.social

MIT License

108 stars 19 forks source link

re-architect content fetching #31

Closed rahulbot closed 6 years ago

rahulbot commented 6 years ago

It is too slow the way we do it now (#29).
Perhaps something like this could work:

add a last_twitter_fetch and last_facebook_fetch timestamp columns to User
a new add_incremental_tasks_to_queue that queues up tasks for the last ~30 users, sorted by last_twitter_fetch DESC (ie. people who haven't been updated in a while) and filtered for users that haven't been updated in the last 4 hours (same for last_facebook_fetch)
run add_incremental_tasks_to_queue every 5 minutes on cron

This needs some thoughtful design and testing to see if it would work.

rajaramsrinivas commented 6 years ago

One other solution that I can think of is, delayed write into db. We are currently doing a db.commit() for every post and I think that is way too many hits on the db. We could write the entire batch of posts in one shot in db as postgres tends to perform better in one large operation vs multiple small operations.

rajaramsrinivas commented 6 years ago

I turned on logs on postgres to see if flask performs any optimizations if we simply skipped the db.commit(). Turns out, it doesn't.

The Pastebin links below contain the insert and update Postgres query logs. We seem to be hitting the db for every single insert and update individually. That's at least two inserts (one for the post and one for association) and 4 updates per post. This, in my opinion, can severely affect performance and limit Gobo's ability to scale.

https://pastebin.com/EAnsWZfa https://pastebin.com/njvhSS2z

The updates are the easiest to fix. We need to maintain them as dicts, batch them up and update to db once per batch.

The inserts might be a little more difficult. We could write to a temp file and use copy from perhaps. I am trying to identify time gaps to confirm my hypothesis.

rahulbot commented 6 years ago

Agreed - fixing the updates seems easy. I think you're suggested that each filter called in analyze_post should just return the computed result and then analyze_post can do one update call.

Why are the inserts harder? All _add_post does is do an insert and then queue up a task for each filter to run on the post. It is only called from for loops for Facebook-sourced and Twitter-sourced content. Could we just those two to chunk the posts and then have _add_post accept a batch of 20 or something to insert more efficiently? More importantly, if we did would that actually reduce the DB load?

rajaramsrinivas commented 6 years ago

I said the inserts could be harder since the query says "insert x returning post id. I'm not sure if we're using the post id response anywhere. This could be because of the way sqlalchemy works and that we don't use it at all. I will go through the code and confirm. If we don't, it should be straightforward too.