universal-ctags / ctags

A maintained ctags implementation
https://ctags.io
GNU General Public License v2.0
6.39k stars 618 forks source link

RFC: Daemon mode #423

Open vhda opened 9 years ago

vhda commented 9 years ago

I've been toying around with this idea in my head, where one could run ctags in a daemon mode. This could prove to be useful specially in development environments, where the file currently under update is constantly changing, but we don't really want to run ctags in the complete repository. In a daemon mode, ctags could quickly replace the tag list in memory of that file with a new one.

I see two different complexity levels for this feature, being possible to develop the second on the top of the first:

  1. Simple Unix (or similar) socket that receives a file name and ctags will parse that file, update the tag list in memory and write a new tags file.
  2. Use dbus instead of the socket and adding the capability to also query the current tag database.

At this point I just to have your opinions about this idea, as I don't even know if I would have the capability of implementing such a feature!

masatake commented 9 years ago

(I don't know detail but I guess geany's tagmanager may do the same thing though it doesn't have IPC interface. Comparing with our ctags, geany's one has well abstracted I/O. Including daemon mode, if you want to do something interesting with ctags, the I/O subssytem is needed.)

Even using simple unix socket, we can implement query interface. So I think we can forget the type of IPC from this discussion. So my understanding 1 and 2 can be rewriten like

1'. the way updating tags file incrementally 2'. query interface

At least following parts are needed:

A. in-memory data base(data structure) B. routine loading tags file into data structure C. routine putting tagsEntryInfo to in-memory data base

As I wrote in #382, functions in xcmd.c can be used for parts B.

As I wrote at the top of this comment, tagmanager/ctags of geany has abstracted output interface. https://github.com/geany/geany/blob/master/tagmanager/ctags/parse.h#L130 This is related to C.

I have never inspected about A. Internally cork has an array for tagsEntryInfo. You can use it as for testing/verifying your ideas. I guess we can learn many things from tagmanager core.

Though I don't want but before entring this exciting area, I have to ask you about releasing. Do we/you want a releasing a version? I don't want because once we released we have to think about (self)compatibility more seriously. When we can ignore the compatibility, we can enjoy random hacking more:-P But I think releasing is a kind of duty.

vhda commented 8 years ago

I think that at least in a first approach it would be acceptable that ctags parses the selected directory recursively before starting the daemon. That way we don't need the capability to load the tags file.

@masatake since you know ctags internals better than I do, how hard would it be to delete all tags corresponding to a specific file using the currently implemented data structure? Would it be too slow (or slower than reparsing a complete workspace)?

masatake commented 8 years ago

@vhda, My understanding that you are talking about in-memory data base(let's call it mdb here). There is no mdb in current ctags.

It is not possible to implement based on cork. Please, look at https://github.com/universal-ctags/ctags/blob/master/docs/output-tag-stream.svg This figure shows output related code of current ctags. In the output of parser is corked, an entry passed to makeTagEntry goes a cork queue via queueTagEntry function. When uncorked, the entries in the cork queue are flushed to a file.

The cork queue is part of TagFile global variable. If ctags can have a cork queue per input files, you can get what you want now.

S1. When ctags starting as a daemon, ctags parses files and store the result to queues. Each queues are associated with each input files. S2. When ctags get a query from a client, ctags scans the queues for getting an answer. S3. When ctags detects the update of an input file, ctags clears the queue associated with the input file, parse the input file again and store the tags to the queue.

cork API is introduced for making scope information easier. So its public API is limited. However, you can extend it for your trial.

For the purpose of S2(querying), cork queues may be slow. It takes O(n * m) here n is number of tags in an input file and m is the number of input files.

Completely different approach is using sqlite. Look at the figure again. There 3 backends(writeEtagsEntry, writeXrefEntry, and writeCtagsEntry) for writeTagEntry. You can add one more, writeSqliteEntry.

cweagans commented 8 years ago

+1 for in-memory sqlite database. That's pretty efficient for this kind of thing.

vhda commented 8 years ago

What do you guys think about the feature itself? Does it make sense? Or would we just be over-complicating ctags?

b4n commented 8 years ago

@vhda I myself don't see the point.

For me, tagging files is not a daemon job [1].

Also, what is the use case? IDE spawning the daemon and asking it to update files? If that's it, you first need non-file-backed parsing, because and IDE will be interested in realtime changes. Also, IMO for that matter better make ctags a library the IDE uses: it's more flexible, and probably simpler for everyone (unless you also make a client library for users, which just makes it more complex). For an IDE, not only you want tags, but you want to be able to make something out of them, so it's likely you have some kind of indexing phase on your side (if you want more than simple lookup). E.g. in Geany we keep several optimized lists for various needs (one for each file, a combined one and one containing only type-like tags). As we also build a per-file tree of the generated tags (based on scope), you could also imagine we wanted to cache this. We don't and just have several optimizations to update it, it it could maybe make sense to cache this also.

My point is that IMO for an IDE what you want is in-memory representation of the tags, and probably in a custom format to meet your own needs. I don't think a daemon mode would achieve this.

Sure, it'd be better than spawning ctags on each update or each file, but I don't think it really is the solution. Again, IMO "just" make CTags a library with a pluggable makeTagEntry() or alike (a bit like currently in Geany's tagmanager) so applications can run the parsers as they wish and get the tags, and then do whatever they see fit with it.

Also, before designing this IMO one should look at what the tools using or embedding CTags do/need. I think Geany, Anjuta, possibly Vim and Emacs, etc.


[1] at least not in what "daemon" generally means in UNIX world. If really it's wanted, I guess a stdin-based communication would make more sense, as in a tool spawns ctags and communicates with it to control it -- this tool and this tool only, it's not a system service or something.

masatake commented 8 years ago

...@b4n wrote everything. Whether we will implement the daemon mode or not, what we need is the same: abstraction of I/O routines.

cweagans commented 8 years ago

Personally, I'd prefer a ctags daemon over a tags file for vim editing. The projects I work on are quite large, and getting the data I need from a tags file is very slow.

vhda commented 8 years ago

In the case o Vim, which is the editor I use, I don't think it would be that easy because you would need to modify the code base to link it to libuctags, or use some kind of Python bindings. Even for other editors you would still need to write C code to implement such a support. Let me say, I completely support the library idea, but I feel that its adoption may tend to be slow.

On the other hand, the tags file parsing is already widely supported in most editors! But if we need to update the tags relative to a file then all files need to be reprocessed, which might be slow if we're working in a big project. And this was the basis for this idea: provide a faster way of updating the tags file.

Using a stdin-based communication would be enough, but in (standard) vim it's not so easy to start a background process. Neovim is working on that, but it's still not stable enough for a typical user IMO. And I don't know how easy this is in other editors.

This was the basis for the daemon idea, which would run only for the user (not on a system level) and preferably one process only per project area. This would allow plugin developers (in languages like Lua, VimL, Python, etc) to have a simple way of starting the ctags "daemon" and communicate with it via sockets asking to update the tags relative to a single file. This would also have the advantage that multiple editor processes in the same project could communicate with a single daemon.

And I'm really looking for a very simple implementation, with the least amount of changes to ctags code base, because I agree that on the long run having all editors link to libuctags is the best solution. But such a daemon mode with a simple "Update file X" command support would probably fill in for a missing functionality that can be used almost immediately by all editors supporting tags files.

masatake commented 8 years ago

If you have mergetags command, you can implement what you want in a shell script.

  1. ctags makes tags files for each source files initially.
  2. If one of source files is updated, a client may let ctags know the name of file. Then ctags just updates the associated tag file.
  3. mergetags gathers tags files into one tags file and send it to the client.

netcat(nc) command and flock command may be needed.

I'm not sure whether your question is about the project policy/roadmap or a implementation approach.

About the project, the question is meaningless in some aspects. Because even if we will not implement the daemon mode, we will write building blocks needed in the daemon mode like libuctags and mergetags. These things are needed anyway in other applications. The differences are when they are implemented.

b4n commented 8 years ago

@vhda OK so I see that what you want to achieve is something that could be a drop-in replacement of the current ctags usage? If that's it, then okay why not if some people would benefit it.

But then I'd like to suggest the same as @masatake: wouldn't a mergetags command be enough? you actually wouldn't even really need a specific tool, you could just do something like this:

#!/bin/sh

export LANG=C

set -e -x

# remove the existing tags for the file
escaped1="$(echo "$1" | sed 's%/%\\/%g')"
sed -i -r "/[^\t]*\t${escaped1}\t/d" tags

# add new at the end
./ctags -o- "$1" >> tags

# and sort the output (optional)
tmp=$(tempfile --prefix="tags.")
sed '/^!_TAG_/!q' tags > "$tmp" # extract the pseudo-tags
sed '/^!_TAG_/d' tags | sort -u >> "$tmp" # and sort the rest
mv -f "$tmp" tags
vhda commented 8 years ago

Just made a small test with the script. Ignoring the sort, it takes 6.84s to run. Takes 7.22s with sort. It takes ctags 0.16s to process the file I randomly chose. The complete parsing of my files takes 10.33s.

Conclusion: there's not much gain between parsing all files and running the script.

b4n commented 8 years ago

@vhda this is totally naive and might be optimized a little by reducing actual writes. At first glance it'd be possible to at least remove one write by combining the ones of removing existing and adding new:

( sed -r "/[^\t]*\t${escaped1}\t/d" < tags && ./ctags -o- "$1") > tags

I'd be interesting to see what actually takes the most time here -- but I guess if it's so slow it probably is sed removing the lines).

BTW, my escaping of the argument is incomplete, should rather be

escaped1="$(echo "$1" | sed 's/[/.*$(){}]/\\\0/g')"

or similar ot also escape regex control characters. another solution that probably is good enough and might be faster would be be something like grep -vF "\t$1\t". that would require testing.

b4n commented 8 years ago

@vhda what about this

#!/bin/sh

export LANG=C

set -e -x

sort=true
tmp=$(tempfile --prefix="tags.")

(
    # remove the existing tags for the file
    grep -vF "  $1  " tags
    # and add new at the end
    ./ctags -o- "$1"
) > "$tmp"

# and sort the output (optional)
if "$sort"; then
    (
        sed -n '/^!_TAG_/!q;p' "$tmp" # extract the pseudo-tags
        sed '/^!_TAG_/d' "$tmp" | sort -u # and sort the rest
    ) > tags
    rm -f "$tmp"
else
    mv -f "$tmp" tags
fi

It also fixes a bug in the sort part where the first tag after the pseudo tags was duplicated (stupid mistake).

b4n commented 8 years ago

Anyway, I guess that the "daemon mode" you suggest could probably be implemented as a separate layer from CTags, as the only definite overhead would be spawning ctags. The rest (reading the original tag file and combining CTags output) should not really be any faster if done outside of ctags.

I'm not saying a shell script is as fast, but a C program reading CTags' stdout should be.

b4n commented 8 years ago

Personally, I'd prefer a ctags daemon over a tags file for vim editing. The projects I work on are quite large, and getting the data I need from a tags file is very slow.

IIUC what @vhda want to introduce would still output a tag file, it would just avoid having to re-parse every file to create the new version of the tag file.

If it's the tag file itself that is slow, it would require Vim to get the input from somewhere else, and I guess that would mean non-trivial changes, wouldn't it? If not, OK, a process you can query might be faster than reading the tags from a file, given it has a clever representation of the tags for what it does.

b4n commented 8 years ago

Ignoring the sort, it takes 6.84s to run. Takes 7.22s with sort. It takes ctags 0.16s to process the file I randomly chose. The complete parsing of my files takes 10.33s.

BTW, not saying this is great, but it's still about 30% faster :)

vhda commented 8 years ago

I'll need to test that update tomorrow at work, where I have the 75MB tags file.

And yes, having a second layer in ctags to read the tags file and append/add the updated tags should not be any faster than your script. The advantage of having ctags running as a daemon would be exactly that: there was no need to read the tags again, because they are still in memory.

But do you think that having ctags parse a single file and writing out would take approximately the same time as the script? I'm assuming that it's not, but I might be wrong... Probably we can quickly modify ctags to timestamp the tags file dump, so that we can measure how long it takes. I don't know where that is done, but I can try to find out :)

b4n commented 8 years ago

I just did quick testing of my new script with a 61MB tag file (the content of my /usr/include), and here's some results:

(all measured with the very precise benchmark time -f 'time: %e mem: %M' run a few times ;)

An interesting point is that obviously reading files is largely IO bound, so reading all the source files will be very slow if they aren't in cache (and given the underlying media is not so fast, maybe not SSD). So one reason why re-parsing only the modified file and updating the tag file is a good idea is because the tag file is likely to be in cache (either was recently accessed or will be -- otherwise, why build it, right), while the whole tree to parse might not (nobody will be working on all the files worth 540k+ tags). Same for the single file to re-parse, it's both one single file and likely to be already in cache.

Also, not that it matters much, but the overall resident memory is a lot smaller just updating one file manually (not sure what eats all this in CTags though, so I don't know what comes into play).

[1] probably doesn't matter so much for one single file though

[edit] Using a tmpfs to store the temporary file brings the merge down to ~0.8s, so IO is an important bottleneck even for the temp file. [edit 2] Putting all output files on a tmpfs gives 7.45s for hot ctags -R and 0.73s for the merge.

b4n commented 8 years ago

[…] The advantage of having ctags running as a daemon would be exactly that: there was no need to read the tags again, because they are still in memory.

But do you think that having ctags parse a single file and writing out would take approximately the same time as the script? I'm assuming that it's not, but I might be wrong...

Nah I guess I'd be faster. Though this would require actual measurements as the tag file is most likely to be cached, so what might take actual time might be the write, I don't know.

Probably we can quickly modify ctags to timestamp the tags file dump, so that we can measure how long it takes. I don't know where that is done, but I can try to find out :)

You can also probably profile ctags to see this.

vhda commented 8 years ago

An interesting point is that obviously reading files is largely IO bound, so reading all the source files will be very slow if they aren't in cache (and given the underlying media is not so fast, maybe not SSD). So one reason why re-parsing only the modified file and updating the tag file is a good idea is because the tag file is likely to be in cache (either was recently accessed or will be -- otherwise, why build it, right), while the whole tree to parse might not (nobody will be working on all the files worth 540k+ tags). Same for the single file to re-parse, it's both one single file and likely to be already in cache.

There other examples: at work I'm accessing a NFS mounted filesystem and I'm not sure if tmpfs is made available in the server images. I'm also not sure how NFS caching works and how fsync calls influence its performance.

But I will try to run some tests tomorrow, if I find some free time ;)

Probably we can quickly modify ctags to timestamp the tags file dump, so that we can measure how long it takes. I don't know where that is done, but I can try to find out :)

You can also probably profile ctags to see this.

Need to learn how to do that. But I think I saw a reference to timestamp within the main code of ctags, so it is possible that such a time measurement functionality is already available. Need to confirm.

b4n commented 8 years ago

Hum, I got an idea to get it faster, and it's working :) Try this:

#!/bin/sh

export LANG=C

set -e -x

[ $# -eq 2 ] # safety check on the args

sort=true
tagfile="$1"
input="$2"
tmp=$(tempfile -d "$(dirname "$tagfile")" --prefix="$(basename "$tagfile")")

(
    # remove the existing tags for the file
    grep -vF "  $input  " "$tagfile"
    # and add new at the end
    ./ctags -o- "$input"
) | (
    if "$sort"; then
        sed -n '/^!_TAG_/!q;p' "$tagfile" # extract the pseudo-tags
        sed '/^!_TAG_/d' /dev/stdin | sort -u # and sort the rest
    else
        cat
    fi
) > "$tmp"

mv -f "$tmp" "$tagfile"

It's the same as before, but:

This makes things a lot faster if /tmp is in a separate FS, as the final move will just be a rename, and it also makes sorting a lot faster as it's done on the stream directly. This however shouldn't change much if not sorting and /tmp is on the same FS as the tag file.

With this, I get 0.66s with sorting enabled, and 0.16s without, all this on a real FS -- no more tmpfs.

masatake commented 8 years ago

I'm surprised that even mergetags itself can be written in a shell script.

Taking $input argument is useful. But there is another approach using time stamps like make. It becomes slow but you don't have to tell ctags which file should be updated.

We can implement these tools based on ctags command. The question is whether we should provide such upper layer tools as part of this project. I think yes. How about call it utags? utags may take sub command like git:

$ utags maketags --updating foo.c bar.c -o tags
$ utags update --timestamp
$ utags query --language=make --kind=t --sort
$ utags daemon --listen-port=9000

Each subcommands are well separated so you can make a trial of your idea safely.

vhda commented 8 years ago

@b4n your new algorithm is taking only 3.14s with sort and 1.61s without sort. This is indeed an improvement :) Apparently grep is much faster than sed, which is something I didn't know.

b4n commented 8 years ago

@vhda

Apparently grep is much faster than sed, which is something I didn't know.

You tried only with the grep change, or the whole last version? (which has many other improvements)

I'm not sure grep is faster than sed per se, we'd have to check with only these difference, but one thing that may make this true is that the grep command here uses fixed strings (-F), not regexes, that's most likely to be faster. But the grep command here (especially designed to avoid regexes) is less correct that the sed one I originally had -- but I think it's totally good enough.

@masatake

I'm surprised that even mergetags itself can be written in a shell script.

Well, the tags format is pretty well structured, and it's all text, so UNIX tools can mostly work on it :)

Taking $input argument is useful. But there is another approach using time stamps like make. It becomes slow but you don't have to tell ctags which file should be updated.

That's interesting. Could be done with an actual Makefile I guess, that should totally be doable.

We can implement these tools based on ctags command. The question is whether we should provide such upper layer tools as part of this project. I think yes. How about call it utags? utags may take sub command like git:

Why not, if there is a use case/demand for this.

b4n commented 8 years ago

I made a Gist with a slightly improved version of the script (adding argument handling): https://gist.github.com/b4n/b4465983d81a75c99f16

vhda commented 8 years ago

Perfect! :) Just one thing: http://stackoverflow.com/questions/18716658/whats-the-difference-between-tempfile-and-mktemp

b4n commented 8 years ago

Good point. I always forget which one is the preferred one :S Fixed.

maurizi commented 8 years ago

You may be interested in checking out this small Rust tool I made recently https://github.com/maurizi/retag.rs.

It watches a directory for file changes (using inotify on linux and polling on other systems) and then does what is basically the merge script above (remove tags from changed file, then generates new tags in append mode).

It uses a temporary file when generating tags, so that the current tag file can still be used while tags are rebuilding, and if you use ZSH there is a ZSH plugin to start/stop it automatically when you enter a Git project directory.

mawww commented 8 years ago

As an additional data point, I use the readtags command for Kakoune ctags support, with a 239 MB tags file at work. accessing a tag by name takes between 50 and 100ms, so I doubt a daemon for tag access is worth it.

For tag name queries (non-prefix based completion for example) where a binary search is not applicable, I use a separate tagnames file generated from tags, containing only tag names, which is 19MB, so grepping in it with arbitrary expressions is fast enough for interactive purposes.

It seems to me the two problems we have is generating new tags and merging them in existing tags file. Both of them should be fixable without a daemon.

vhda commented 8 years ago

Please take #80 also into consideration.

dtikhonov commented 8 years ago

crontab -e solves this problem for me. I don't think we should teach ctags how to daemon.

dtikhonov commented 8 years ago

If we want to make ctags faster, the simplest thing to do is to rewrite iFileGetLine to read the file in 4 KB chunks instead of character by character. (Each char may be tested three or four times!) We'll double or triple the speed.

vhda commented 8 years ago

@dtikhonov when using a cron job, if ctags opens the file at the start of a recursive processing and that process takes 10s, isn't there a chance that you'll have an incomplete tags file at the time you read it? Regarding iFileGetLine, how hard would it be to implement that change? Apparently I have the slowest NFS disk in the all wide world, so I can easily test it here ;)

dtikhonov commented 8 years ago

@vhda, it may not be that simple -- and it will require some testing. The current code has been working for a long time, so changes there are risky. I'd leave it until after 6.0 comes out.