sillsdev / languageforge-lexbox

Lexbox, SIL linguistic data hub
MIT License
7 stars 2 forks source link

Research options for improving hg[web] performace #693

Closed myieye closed 1 month ago

myieye commented 7 months ago

Ideas we want to pursue:

Other ideas mentioned:

hahn-kev commented 7 months ago

This image tells an interesting story. The results are sorted by the sum of time spent on each type of request in the last 7 days image

Put another way, this is a pie chart of the sum of time spent. Picture1

Basically hgweb is spending over 50% of it's time responding to capabilities requests. It spends around %90 of it's time responding to capabilities and our commit log requests. We can fix this.

capabliites

I'm fairly certain we can just cache this across all projects. If we really wanted we could just hard code the response, but I think it makes more sense to just cache it for 24hrs.

hg log

Right now we're using the json api that hgweb provides because it's convenient, but I think we should shift that over to the command server and it may perform better. That would require some testing. We can use hg log -T json to get json output from the CLI, so we don't even need to parse the hg log output manually which is a huge win.

myieye commented 7 months ago

But, if the capabilities request is slow, because hgweb is refreshing it's lost of repos, then caching that request will result in the next request being slow. I think capabilities is slow, because it happens to be the first request in a send/receive.

Maybe that theory needs to be tested. But, I think we ultimately need to solve the refresh performance problem.

Perhaps it's worth mentioning this again: we could maintain an exhaustive list of repos instead of using a wildcard in our hgweb paths config.

I love the idea of using our command runner to get change sets for the UI.

hahn-kev commented 7 months ago

our performance is fine again. This happened after LTOps did some maintenance to the cluster. It could be that there was a node that was having an issue or something else, it's not clear. The good news is that it's not a code issue on our end, but that's also the bad news Sunday 7th was when the maintenance happened. image

myieye commented 7 months ago

😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲 😲

myieye commented 7 months ago

Wow, just look at that long mysterious story of performance sorrows that unexpectedly ended with cheers of joy:

image

@hahn-kev Does anyone at TechOps know that we think they caused and fixed this for us?

hahn-kev commented 7 months ago

yeah I've talked to Greg about it. They don't really know, it's pretty discouraging, I'm hoping we can get some more data to track what's going on. Maybe make something to measure FS performance regularly.

megahirt commented 7 months ago

This could be one of those "we rebooted and things are better now" situations.

Chris

On Thu, Apr 11, 2024 at 11:35 PM Kevin Hahn @.***> wrote:

yeah I've talked to Greg about it. They don't really know, it's pretty discouraging, I'm hoping we can get some more data to track what's going on. Maybe make something to measure FS performance regularly.

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/languageforge-lexbox/issues/693#issuecomment-2050085081, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2I6KM4QT33XXY3ZWQWOJTY423VHAVCNFSM6AAAAABFYY4WICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGA4DKMBYGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hahn-kev commented 6 months ago

we recently deployed a new version and the hg pod ended up on a different node, previously the node it was on was the same node that the volume was mounted on. We also saw a small performance regression Image

hahn-kev commented 1 month ago

I think we can consider this issue fixed