Add cache timeout of 5s, now works with offline git commits

dereks commented 7 years ago

To me, the primary benefit of a git-and-file-based wiki is that one can also use plain git + text editor to maintain the wiki (in addition to using the beautiful browser interface).

Realms is currently incompatible with this type of usage, because it keeps its cached pages forever (unless the user makes a change using the Realms GUI). It will ignore a new git commit that was not made via the Realms GUI, because it finds its old cached HTML and thinks that old HTML is still valid.

This commit adds a 5 second cache timeout, thus causing offline git commits to show up no later than 5s later (when the Realms cache expires). The hard-coded magic number of 5s will limit the number of cache refreshes to 12 per second, which still provides good caching for busy servers, but now allows us SSH+vi users to easily create and maintain pages.

gazpachoking commented 7 years ago

My plan (#155) was to make a module to set up an external git repo that can be synced with, and automatically invalidate cache when it pulls changes from remote. Unfortunately I haven't quite found the motivation to actually make it yet. Does this change also invalidate the history cache? That takes a very long time to read from the git repo depending on how much history you have in there.

dereks commented 7 years ago

Does this change also invalidate the history cache?

Yes, all cached items which use the global Python object 'cache' (which is a Flask 'Cache' object) now expire after 5s.

That takes a very long time to read from the git repo depending on how much history you have in there.

It shouldn't.

I have used the "git log" command with large git histories (like the Android AOSP source, Linux kernel source, long-running software projects, etc.) and it's always very fast. I don't see any reason it should be slow in Realms-wiki.

If you can provide a public git repo of Markdown files, with a large history, I will try to repro the "slow" history page you describe. If it takes "a very long time", then that is a bug that should be fixed in Realms-wiki... I would be happy to help with that, as I am putting Realms-wiki into production use here.

scragg0x commented 7 years ago

I busy site would likely utilize a reverse proxy cache using varnish or nginx, or a CDN like cloudflare. Also to invalidate server-side cache better a post-commit git hook can be used to execute a realms cli cmd. I've never done the latter before though.

gazpachoking commented 7 years ago

Some of the history for our site (http://flexget.com) can take up to 30 seconds to load on our server. (here's the git repo for it https://github.com/Flexget/wiki) It isn't so bad now that the actual items are loaded in the background, as at least the page comes up quickly. Formerly, every page load triggered finding the last commit which affected that page, which meant every non-cached page load could take that long to come up (luckily that isn't a problem anymore.) This isn't unique to realms, you can see they had the exact same problem over at gollum wiki https://github.com/gollum/gollum/issues/1078

gazpachoking commented 7 years ago

Also to invalidate server-side cache better a post-commit git hook can be used to execute a realms cli cmd. I've never done the latter before though.

I think something like this would be nicer than invalidating the cache every 5 seconds by default. If we ever make an integrated plugin to sync with a remote repo the invalidation setup would be greatly eased.

gazpachoking commented 7 years ago

Or, could we make this just apply to the wiki pages and not the history cache? The history cache is built so that the old bits can stay cached and will get updated with only newer items when needed.

dereks commented 7 years ago

Or, could we make this just apply to the wiki pages and not the history cache?

No, I think that would defeat the purpose, which is to make Realms-wiki compatible with out-of-band editing of the wiki's .md files with pure git. An out-of-date history would be displayed to Realms users, but not command-line git users, if the history cache did not expire.

Some of the history for our site (http://flexget.com) can take up to 30 seconds to load on our server.

That is a straight-up bug. This is something that can be fixed. A maximum page load time should be 1 to 5 seconds. On my ~10 year old AMD Athlon(tm) 64 X2 Dual Core, with Realms-wiki running through gunicorn and SSL, most of my lightweight test pages take less than 0.5 s to load.

I see your site is public; would you be willing to publish your server-side git database to github? I'd like to run timing tests. First using Realms, then command-line git, to see where the slowness is.

Perhaps the usage pattern of git requires using git gc or git prune, or perhaps there is a parser recursion someplace, or perhaps SQLite (or whatever your database server) is the culprit. Or perhaps I will be unable to reproduce (could the problem depend on the memory or disk I/O limits of your server?).

As an aside, there are several types of backend storage supported by Flask's Cache (from their docs):

null: NullCache (default)
simple: SimpleCache
memcached: MemcachedCache (pylibmc or memcache required)
gaememcached: GAEMemcachedCache
redis: RedisCache (Werkzeug 0.7 required)
filesystem: FileSystemCache
saslmemcached: SASLMemcachedCache (pylibmc required)

Maybe trying a different cache backend is the correct solution... anyway, I'd like to fix the bug. 30 seconds is too long.

This isn't unique to realms, you can see they had the exact same problem over at gollum wiki

The bug you referenced had a comment proposing two possible causes:

Inefficient tag filtering
Poor caching

In Gollum, both of those things are done by Ruby libraries that Realms-wiki does not use. (The "Poor caching" mentioned there was a reference to needing multiple backend types, like Flask has). I would be surprised if the root cause of your 30 s page time is the same.

I agree that a 5 s cache timeout is a simplistic solution. A git hook would be better. It would be best if the git hook could invalidate just the affected pages' cache entries, rather flushing the whole thing.

gazpachoking commented 7 years ago

No, I think that would defeat the purpose, which is to make Realms-wiki compatible with out-of-band editing of the wiki's .md files with pure git. An out-of-date history would be displayed to Realms users, but not command-line git users, if the history cache did not expire.

The way the history cache works, (or at least the way I intended,) is it compares the current HEAD to the cached head from the history cache, and generates any missing pieces between HEAD and the last cached item. I think that should still work even when we don't invalidate the history cache.

I see your site is public; would you be willing to publish your server-side git database to github? I'd like to run timing tests. First using Realms, then command-line git, to see where the slowness is.

Already on github here https://github.com/Flexget/wiki . Would be awesome if you can speed this up. I couldn't get it to work fast, which is why I implemented the somewhat complicated history cache capable of filling in the gaps and only loading a page at a time. Would be great to remove all that if we can make it work faster.

In Gollum, both of those things are done by Ruby libraries that Realms-wiki does not use. (The "Poor caching" mentioned there was a reference to needing multiple backend types, like Flask has). I would be surprised if the root cause of your 30 s page time is the same.

https://github.com/gollum/gollum/issues/1078#issuecomment-238097174 Here you can see they traced the slowness to querying the underlying git repo for the commit history of a given file. They decided on the same solutions I had in realms, which was to eliminate the call to determine the latest commit to a file when viewing a page, and only do that with ajax in other places due to the inherent slowness. I had assumed this very similar problem was because of the way git works, you have to load a lot of the git blobs to do this tracing of history for a file, which manifested in both the python and ruby git libs. I could be wrong though, once again, would be great if you can figure something to speed it up.

Perhaps the usage pattern of git requires using git gc or git prune, or perhaps there is a parser recursion someplace, or perhaps SQLite (or whatever your database server) is the culprit. Or perhaps I will be unable to reproduce (could the problem depend on the memory or disk I/O limits of your server?).

There isn't a database involved, logins are via oauth, pretty sure that's the only thing realms uses the db for. The slowness could certainly be due to disk limits of the server, doing that trace requires loading a bunch of the git blobs off the disk, but I think for that reason it's good to keep the already generated history cached (if we can't figure a significant speedup.)

As an aside, there are several types of backend storage supported by Flask's Cache (from their docs):

flexget.com is using redis currently, but everything is nice and quick when it's cached already. The problem is just when it has to generate the history cache.

gazpachoking commented 7 years ago

To expand on what I mean by git history inherently requiring loading many blobs from the disk, what it must do to get the history for any given wiki file in the repo is: Read every commit blob from the disk in the entire history of the git repo all the way from HEAD back to the initial commit. For every one of those commits, it also has to load the associated tree blob from disk and scan to see if that commit affected the file in question. When the git repo gets a lot of commits, and a lot of files, that's quite a bit of stuff to do.

dereks commented 7 years ago

Thanks for the references. I will take a look within the next couple of days and try to repro the issue.

Here you can see they traced the slowness to querying the underlying git repo for the commit history of a given file.

I disagree that it was traced to the underlying git repo. That comment specifically mentions the slowness is in a Ruby function called page.last_version, and furthermore, it says the slowness is caused by their Ruby code instead of raw git:

This is indeed extra slow when using the rugged_adapter, as I think @joaquinvanschoren reported. That is because the rugged_adapter implements the relevant logic (in Gollum::Git#versions_for_path) in ruby, whereas grit ends up calling C-git, which is significantly faster.

(Emphasis mine.) So my money is still on a bug in the Realms Python, and not in the git database itself. Time for science. I'll report back to this thread.

dereks commented 7 years ago

@gazpachoking, Could you post the full paths to 2 or 3 of your slowest history pages on Flexget?

Please list exact pages that where you see the symptom. I will diagnose those pages in particular. The slower the better. (Please post here publicly so others can also reproduce and/or diagnose this issue.)

Thanks!

gazpachoking commented 7 years ago

The slowest history load times will be for any pages who only have enough commits to fit on one page of the history view. The time should be roughly the same for any such page, as it has to trace the entire history of the repo in one go to generate the history items for that file. Hard to link them a bit because as soon as one of those history pages is visited it gets cached and the problem disappears for that page. If you go to the index page and pick a random page you are pretty likely to come across one, we have a few heavily edited pages, but a bunch of them probably only have a few commits and will show the issue.

gazpachoking commented 7 years ago

Hard to link them a bit because as soon as one of those history pages is visited it gets cached and the problem disappears for that page.

Hmm, that was true last time I tested, but it seems maybe something isn't quite working as planned with the cache even on our deployed version. Here is one such page, https://flexget.com/_history/PluginError Who knows if it will already be cached when you try.

EDIT: A couple more such pages https://flexget.com/_history/TerminalTable https://flexget.com/_history/Searches/divxatope Also, using the flat index makes it easier (and more likely) you'll get to such an enedited page, as our deeply nested ones tend to be less active https://flexget.com/_index?flat=True

gazpachoking commented 7 years ago

I guess I'm actually not even super fussed about history load time anymore now that the items are loaded with ajax. History isn't needed so commonly anyway. If we can speed it up, great, otherwise I'm down for whatever everyone thinks is best. If the cache stays shortened to 5 seconds though, we should probably cut out some of the complicated history gap filling cache I have in there now as it won't be super useful at 5s intervals and just adds complexity.

gazpachoking commented 7 years ago

I disagree that it was traced to the underlying git repo. That comment specifically mentions the slowness is in a Ruby function called page.last_version, and furthermore, it says the slowness is caused by their Ruby code instead of raw git:

That ruby function had to trace the history commits in the git repo just like our history page does. And our git library (dulwich) is pure python (EDIT: I forgot about the C extensions in dulwich, just double checked, and they are enabled on that server,) which is why I think it's slow for the same reason their pure ruby git library was slow at that operation.

EDIT: To elaborate, in a git repo with many commits, calling their ruby function page.last_version on a worst case page that was only edited in the initial git commit of the repo would cause the same history trace all the way down to the root commit just like a full history of the page would.

dereks commented 7 years ago

I am able to reproduce the issue. The Realms AJAX call to get history for the home page (home.md) takes about 5 seconds. This is a ridiculously long time, as home.md only has 311 commits.

In contrast, I can get the entire history of home.md with command-line git in less than a quarter-second:

ubuntu@cst-wiki:~/Flexget/wiki$ time git log home.md | wc
   1865    4688   47213

real    0m0.245s
user    0m0.240s
sys 0m0.004s

My conclusion is that the bug is not in the git database, it is somewhere in Realms or Dulwich.

That ruby function had to trace the history commits in the git repo just like our history page does.

I do not understand why this is necessary. Surely there is an equivalent of git log filename.md under Dulwich? I've never used Dulwich before, so I'll dig into it more later today.

Thanks again.

gazpachoking commented 7 years ago

I am able to reproduce the issue. The Realms AJAX call to get history for the home page (home.md) takes about 5 seconds. This is a ridiculously long time, as home.md only has 311 commits.

Doesn't matter how many commits the actual file has, just the whole repo. The way git works under the hood, you cannot trace back the history of commits for just one file, you must trace from the head commit back, checking whether each commit touched the file. Now that I think about it though, there might be an easy optimization to be added: I'm not sure if it stops searching at the commit where it says the file was added. It might search all the way back to the root commit no matter what. I could be wrong, but deserves some investigation.

In contrast, I can get the entire history of home.md with command-line git in less than a quarter-second:

My assumption was that was just a much faster C implementation.

dereks commented 7 years ago

(cont'd)

First, a note: Realms is walking the commit tree, searching for relevant commits. Newer versions of Dulwich have a very similar method called dulwich.porcelain.log(), which does the same thing (but it prints it to STDOUT instead of returning a list of Python commit objects).

Eliminating Realms entirely and just calling dulwich.porcelain.log() is also very slow. The following line of Python code takes several seconds to execute:

dulwich.porcelain.log("../../Flexget/wiki/", paths=iter(["home.md"]))

It seems that walking the Dulwich commit tree in Python is just plain slow. We are not the first ones to be burned by this. Read this code comment I found in another Python project, at https://github.com/jonashaag/klaus/blob/master/klaus/repo.py:

    def history(self, commit, path=None, max_commits=None, skip=0):
        """Return a list of all commits that affected `path`, starting at branch
        or commit `commit`. `skip` can be used for pagination, `max_commits`
        to limit the number of commits returned.

        Similar to `git log [branch/commit] [--skip skip] [-n max_commits]`.
        """
        # XXX The pure-Python/dulwich code is very slow compared to `git log`
        #     at the time of this writing (mid-2012).
        #     For instance, `git log .tx` in the Django root directory takes
        #     about 0.15s on my machine whereas the history() method needs 5s.
        #     Therefore we use `git log` here until dulwich gets faster.
        #     For the pure-Python implementation, see the 'purepy-hist' branch.

        cmd = ['git', 'log', '--format=%H']
        if skip:
            cmd.append('--skip=%d' % skip)
        if max_commits:
            cmd.append('--max-count=%d' % max_commits)
        cmd.append(commit.id)
        if path:
            cmd.extend(['--', path])

        output = check_output(cmd, cwd=os.path.abspath(self.path))
        sha1_sums = output.strip().split(b'\n')
        return [self[sha1] for sha1 in sha1_sums]

Personally, I like his solution of simply shelling out to "git log". Very few lines of code.

And I found this comment at https://pypi.python.org/pypi/gitmit/0.1 .

Originally I was shelling out to git log for every file. This was slow! I then moved to dulwich (python implementation of git) which was faster, but still slow. Finally I implemented it with pygit2 (libgit2 c bindings in python) and decided to make it a separate library.

So it seems that the Dulwich commit object hierarchy is to blame.

I will look at the git C code. I don't know if it is faster simply because of C, or if they have a smarter way to find relevant entries in the tree. But right now it looks like this Realms bug fix will boil down to one of these:

Find a new, faster way to walk the Dulwich commit tree finding relevant entries, or
Use the C code, either with pygit2, or by shelling out to git log ... .

It seems like fixing this in Dulwich would be generally useful, so I will take a stab at that first. But I don't hold high hopes of finding a magic walker algorithm in git's C code.

Assuming that fails, what are your thoughts on libgit2 vs. git log ...?

dereks commented 7 years ago

(cont'd)

I did a deep dive into Dulwich, and also read through much of the command-line git C code. I used the Python speed test script for Dulwich posted here, which includes profiling (it's an attachment to comment #4):

https://bugs.launchpad.net/dulwich/+bug/535118

As a simple example, I found that this one line of code, which provides the entire Flexget commit history, takes about 1 second on my old hardware:

dulwich.porcelain.log("../../Flexget/wiki/")

However, to get the history for only home.md, it takes over 20 seconds! Note the additional argument paths:

dulwich.porcelain.log("../../Flexget/wiki/", paths=iter(["home.md"]))

The big jump in slowness takes place in Dulwich's walk.py, in method Walker._should_return(), which is called on each commit to decide if it should be filtered out or not (depending on whether the commit shows a change to one of the files listed in paths[]).

This test is done very inefficiently. Walker() walks over every commit, getting all of the deltas between it and its parent (by recursively walking over all associated git Tree objects and comparing file revision numbers). It then only returns the commit if its delta matches one of the supplied paths. This means (basically) that every commit gets its entire Tree object hierarchy walked over twice: once to check against its parent, and once again to check against its child. (Most of the Tree objects get walked multiple times, as they are used by multiple commits.)

There is no caching of the results of walking the Trees. There is no early filtering of git Trees that have a "previously seen" revision, nor pruning of sub-Trees that don't contain the paths we care about. In contrast, the command-line git C code uses optimizations like pruning while walking, and stopping when the requested path is unreachable. (Command-line git has bit flags like "seen" and "uninteresting" that get flagged on objects as it walks the tree.)

But the real reason for Dulwich's slowness is with it's .pack file implementation, in pack.py and _pack.c. It does a stat(), lseek(), and read() every time it goes to grab an object (like a git Tree) out of the pack file. I used the utility strace() to see this -- 104K system calls to find logs for home.md:

(.venv) ubuntu@cst-wiki:~/realms-wiki/.venv$ strace python ./speed_test.py ../../Flexget/wiki 2>&1 | grep -E 'read|stat|seek' | wc
 104284  566792 9620847

Command-line git, in contrast, does a grand total of 87 of those calls to get the same data:

(.venv) ubuntu@cst-wiki:~/Flexget/wiki$ strace git log home.md 2>&1 | grep -E 'read|stat|seek' | wc
     87     563    5872

No wonder it's 100x faster.

I will take another look at Dulwich's pack.py next. It might be possible to get a quick speed improvement by putting the whole pack file into RAM first to avoid all those system calls (it's about 15 MB, not bad by modern standards). Or maybe it could just do some simple caching. After that, there may still be some optimizations for class Walker(), but pack.py should definitely be fixed first.

This will probably not be a quick fix. So, my recommendation is to just shell out to command-line git log ... for now, like the example code above.

gazpachoking commented 7 years ago

@dereks Not sure what the best way forward is then. Maybe @scragg0x could weigh in.

dereks commented 7 years ago

I tried modifying Dulwich to use a memory-mapped (mmap) file for the pack file. I also hacked around a bug where it would stat the pack file every time it tried to read from it.

That successfully reduced the number of read/stat/seek system calls from 104 K to just 688, but it only sped up the search by about 1 second (out of 20 secs). So, all those syscalls() are not the major offender here (but they don't help, either).

I looked closer at the profiling data, and the way Dulwich walks the tree does about 4 million calls to \<string>:8(__new__), and also about 4 million calls to {built-in method __new__ of type object at 0x906bc0}. There are also 2 million calls to {dulwich._diff_tree._is_tree} -- why? -- and about 1 million calls to each of {method 'extend' of 'list' objects} and {method 'pop' of 'list' objects}.

So, my revised conclusion is that the slowness is mostly due to the repeatedly-recursive nature in which the git Tree objects are walked in Dulwich. I do not think there's a simple bug fix. IMHO, the files diff_tree.py and walk.py need to be redesigned from scratch.

I'm still in favor of shelling out to command-line "git log ...", but I haven't looked at libgit2 yet. I would be willing to look at libgit2 if that is a preferred solution, but I'm running out of time for this bug.

@scragg0x , thoughts?

scragg0x commented 7 years ago

I will merge a PR that uses git log for now.

dereks commented 7 years ago

I looked at libgit2, which is written in C, but has Python bindings.

Sadly, it has the same basic issue as Dulwich, even though it's written in C. Here is one of several posts about it, titled, "Why would it take nearly 10 seconds just to query one file's last commit log":

https://github.com/libgit2/libgit2/issues/3507

In that post, the user wound up doing the same research I've done here. He also found that command-line git was 100x faster than libgit2, because it dynamically flagged commit objects with optimization flags like SEEN, UNINTERESTING, TREESAME, CHILD_SHOWN, etc. It then uses these flags to early-out from the tree walk whenever possible.

Official response from libgit2: "We have not implemented git log functionality." All of the libgit2 bug entries about git log were collected into this one summary bug report, which is still open:

https://github.com/libgit2/libgit2/issues/3041

Thus, with libgit2 disqualified, my final recommendation is to shell out to command-line git for Reams-wiki page histories. (I'll implement it if @scragg0x agrees, and would accept the Pull Request.)

dereks commented 7 years ago

Ha, beat me by 6 minutes. I'll get it to you within the next couple of days. Thanks for your time!

scragg0x / realms-wiki

Add cache timeout of 5s, now works with offline git commits #192