Closed myieye closed 1 month ago
This image tells an interesting story. The results are sorted by the sum of time spent on each type of request in the last 7 days
Put another way, this is a pie chart of the sum of time spent.
Basically hgweb is spending over 50% of it's time responding to capabilities requests. It spends around %90 of it's time responding to capabilities and our commit log requests. We can fix this.
I'm fairly certain we can just cache this across all projects. If we really wanted we could just hard code the response, but I think it makes more sense to just cache it for 24hrs.
Right now we're using the json api that hgweb provides because it's convenient, but I think we should shift that over to the command server and it may perform better. That would require some testing. We can use hg log -T json
to get json output from the CLI, so we don't even need to parse the hg log output manually which is a huge win.
But, if the capabilities request is slow, because hgweb is refreshing it's lost of repos, then caching that request will result in the next request being slow. I think capabilities is slow, because it happens to be the first request in a send/receive.
Maybe that theory needs to be tested. But, I think we ultimately need to solve the refresh performance problem.
Perhaps it's worth mentioning this again: we could maintain an exhaustive list of repos instead of using a wildcard in our hgweb paths config.
I love the idea of using our command runner to get change sets for the UI.
our performance is fine again. This happened after LTOps did some maintenance to the cluster. It could be that there was a node that was having an issue or something else, it's not clear. The good news is that it's not a code issue on our end, but that's also the bad news Sunday 7th was when the maintenance happened.
😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲
Wow, just look at that long mysterious story of performance sorrows that unexpectedly ended with cheers of joy:
@hahn-kev Does anyone at TechOps know that we think they caused and fixed this for us?
yeah I've talked to Greg about it. They don't really know, it's pretty discouraging, I'm hoping we can get some more data to track what's going on. Maybe make something to measure FS performance regularly.
This could be one of those "we rebooted and things are better now" situations.
Chris
On Thu, Apr 11, 2024 at 11:35 PM Kevin Hahn @.***> wrote:
yeah I've talked to Greg about it. They don't really know, it's pretty discouraging, I'm hoping we can get some more data to track what's going on. Maybe make something to measure FS performance regularly.
— Reply to this email directly, view it on GitHub https://github.com/sillsdev/languageforge-lexbox/issues/693#issuecomment-2050085081, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2I6KM4QT33XXY3ZWQWOJTY423VHAVCNFSM6AAAAABFYY4WICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGA4DKMBYGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
we recently deployed a new version and the hg pod ended up on a different node, previously the node it was on was the same node that the volume was mounted on. We also saw a small performance regression
I think we can consider this issue fixed
Ideas we want to pursue:
Other ideas mentioned: