Closed rahulbot closed 6 years ago
One other solution that I can think of is, delayed write into db. We are currently doing a db.commit() for every post and I think that is way too many hits on the db. We could write the entire batch of posts in one shot in db as postgres tends to perform better in one large operation vs multiple small operations.
I turned on logs on postgres to see if flask performs any optimizations if we simply skipped the db.commit(). Turns out, it doesn't.
The Pastebin links below contain the insert and update Postgres query logs. We seem to be hitting the db for every single insert and update individually. That's at least two inserts (one for the post and one for association) and 4 updates per post. This, in my opinion, can severely affect performance and limit Gobo's ability to scale.
https://pastebin.com/EAnsWZfa https://pastebin.com/njvhSS2z
The updates are the easiest to fix. We need to maintain them as dicts, batch them up and update to db once per batch.
The inserts might be a little more difficult. We could write to a temp file and use copy from perhaps. I am trying to identify time gaps to confirm my hypothesis.
Agreed - fixing the updates seems easy. I think you're suggested that each filter called in analyze_post
should just return the computed result and then analyze_post
can do one update call.
Why are the inserts harder? All _add_post
does is do an insert and then queue up a task for each filter to run on the post. It is only called from for loops for Facebook-sourced and Twitter-sourced content. Could we just those two to chunk the posts and then have _add_post
accept a batch of 20 or something to insert more efficiently? More importantly, if we did would that actually reduce the DB load?
I said the inserts could be harder since the query says "insert x returning post id. I'm not sure if we're using the post id response anywhere. This could be because of the way sqlalchemy works and that we don't use it at all. I will go through the code and confirm. If we don't, it should be straightforward too.
It is too slow the way we do it now (#29).
Perhaps something like this could work:
This needs some thoughtful design and testing to see if it would work.