feature: clean up stale tags when appending (`-a`)

aktau commented 7 years ago

As referenced in #1420. Ideally both features would be implemented for an enormous speed up and increase in ease-of-use (reduction in duplicated code all over the place). The FR:

Filter out tags belonging to files that no longer exist when updating a tags file (in append mode).
Filter out (stale) tags belonging to the file that is currently being updated.

Because exuberant/universal ctags doesn't support these two things, many tool developers are forced to implement something like it to when regenerating (parts) of their tags files. Some examples:

https://github.com/ludovicchabant/vim-gutentags#what
https://github.com/craigemery/vim-autotag#autotagvim
... (there's quite a few more, not all for Vim).

This would be a natural extension of the append functionality, IMHO.

aktau commented 7 years ago

Another example of someone who implemented tags cleaning: https://github.com/kainz/incremental-ctags-hooks.

Citation:

Implementation OK, this is the ugly part. We temporarily write and execute a C program to filter 'removed' files from your ctags, then feed the remainder of changed files as seen by git-diff into a ctags --append. While this is incredibly ugly, it is orders of magnitude faster than awk/sed/perl, and about 10-20% faster than CPython on my tests involving an approximately 80MB tags file.

masatake commented 4 years ago

Now we can link libreadtags to ctags. libreadtags in ctags can be used for the filtering in ctags.

Alternative approach is introducing new command like linktags. ldtags, or edittags. More study is needed. I'm working on this topic very slowly.

masatake commented 4 years ago

make tags for whole the source code (linux kernel):

% /bin/time -p u-ctags --options=NONE --fields=+KZz -o linux.tags -R code/linux                    
u-ctags: Notice: No options will be read from files or environment
real 54.79
user 54.23
sys 2.82

dropping the tags for code/linux/block/bfq-iosched.c (here I assume you edit the file.)

%  { readtags -t linux.tags -D; /bin/time -p readtags -t linux.tags -en -Q '(not (eq? $input "code/linux/block/bfq-iosched.c"))' -l } > filtered.tags
real 8.34
user 7.79
sys 0.52

tagging code/linux/block/bfq-iosched.c again:

% /bin/time -p u-ctags --options=NONE --fields=+KZz -o part.tags -R code/linux/block/bfq-iosched.c
u-ctags: Notice: No options will be read from files or environment
real 0.01
user 0.00
sys 0.00

Merging filtered.tags and part.tags:

% LC_COLLATE=C LC_ALL=C /bin/time -p sort -u --parallel=4 linux.tags part.tags > tags
real 1.04
user 1.09
sys 0.60

It takes about 9.5s to update the tags file. About 5 times faster than full parsing. I wonder I can improve the performance of -Q option more.

Just for listing with readtags:

%  { readtags -t linux.tags -D; /bin/time -p readtags -t linux.tags -en  -l } > just-listing.tags
real 3.54
user 3.06
sys 0.45

aktau commented 4 years ago

A couple of questions:

Why does your last step used sort instead of u-ctags --append? Is it because there are no domain-specific optimizations and thus the speed is similar?
The approach you're using (filter ; retag-partial ; merge) seems very similar to what I mentioned in https://github.com/universal-ctags/ctags/issues/1421#issuecomment-304720272. It looks like your tool implements (interpretes) some lisp-like language. Could you try something like grep --fixed-strings -v 'code/linux/block/bfq-iosched.c' and see if there is a slowdown or speedup or a difference in the results? grep is highly optimized for this sort of use-case, and I would be somewhat surprised if readtags is faster.

masatake commented 4 years ago

Thank you for the comments.

Why does your last step used sort instead of u-ctags --append? Is it because there are no domain-specific optimizations and thus the speed is similar?

No optimization here. I'm not familiar with the option. So I just forgot it.

% cp filtered.tags filtered2.tags
% /bin/time -p u-ctags --options=NONE --fields=+KZz --append filtered2.tags  code/linux/block/bfq-iosched.c
u-ctags: Notice: No options will be read from files or environment
real 1.51
user 1.74
sys 0.89

As I expected, the result is not so changed. However, we can reduce the step. Thank you.

The approach you're using (filter ; retag-partial ; merge) seems very similar to what I mentioned in #1421 (comment). It looks like your tool implements (interpretes) some lisp-like language. Could you try something like grep --fixed-strings -v 'code/linux/block/bfq-iosched.c' and see if there is a slowdown or speedup or a difference in the results? grep is highly optimized for this sort of use-case, and I would be somewhat surprised if readtags is faster.

You are correct.

% /bin/time -p grep --fixed-strings -v 'code/linux/block/bfq-iosched.c' tags > filtered-2.tags             
real 0.77
user 0.47
sys 0.30

grep is much faster than readtags -Q. Just for updating, using a wrapper shell script will be the best. I have to write this as a tips to FAQ.

I'm thinking about adding feature for reading tags files to ctags expanding Cpreprocessor macros during parsing. The current implementation (#2427) allows expanding macros defined in the same input file. If ctags can read macro definitions stored in tags files, ctags can overcome the limitation "in the same input file".

universal-ctags / ctags

feature: clean up stale tags when appending (`-a`) #1421