redguardtoo / company-ctags

Fastest Emacs auto-completion using Company and Ctags
GNU General Public License v3.0
56 stars 3 forks source link

implement non-prefix completion #4

Closed AmaiKinono closed 4 years ago

AmaiKinono commented 4 years ago

The technique is simple. Take an example, I put a symbol hello under the keys ?h, ?e, ?l and ?o, that's what makes non-prefix completion possible.

However, this is a modification on the basic mechanism, and currently I have no idea about the impact on the performance. I can do some test later, but I think it's better for you to do the test since you are more experienced at working on large projects.

I have these two goals:

  1. I really want to ensure that people can use both normal or non-prefix completion in the same Emacs session. Currently with this implementation you can do things described in https://github.com/redguardtoo/company-ctags/issues/2, using something like

    (defun my-counsel-company ()
     (interactive)
     (company-abort)
     (let ((company-ctags-non-prefix-completion t))
       (call-interactively 'counsel-company)))

    So people used to normal completion can also use substring filtering to find a symbol when he doesn't remember the name.

  2. The speed of normal completion shouldn't be influenced too much. If this happens, maybe we can try using integers computed from 2 chars as the keys. In the worst case, we can split the backends into a normal one and non-prefix one, and use two hash tables for them.

AmaiKinono commented 4 years ago

In fact I think we should use keys computed from 2 or even 3 chars, since that's what hash tables are designed for. It is what you need when there's a lot of key-value pairs.

Since the overhead for finding a key-value pair in a hash table is nearly constant, while searching in a list gets slower when the list is large, we should grow the hash table and shrink the list we use for values.

redguardtoo commented 4 years ago

Let me think about this.

AmaiKinono commented 4 years ago

Here are the steps I want to take further:

  1. For the tagname dict, use a cons pair for the value of a key. Its car contains all symbols start with the key, and cdr contains all symbols contain but not start with the key. By doing so, we don't need to deal with unwanted symbols when doing normal completion (just use that car). So the performance of normal completion will not be affected, and we can experiment with non-prefix completion safely.

  2. Add keys computed from 2 chars for symbols longer than 2 chars, get rid of the use of upper case chars, and push symbols in a case insensitive way, i.e. put both function and Function under fu. By doing so, we make case insensitive completion possible. The variable company-ctags-ignore-case currently doesn't work.

Now I'll talk about the pros and cons of using 2 chars. I did some statistics on the Linux kernel source code. The symbols are all converted to lower case. This is the distribution of first letters of the symbols (i.e. amount of symbols under every key, if using first letter as the key):

key numbers percentage
s   103446  10.97%
m   74999   7.95%
a   67483   7.15%
c   63194   6.70%
p   62895   6.67%
i   61831   6.55%
...
Total: 943287

So there are about 94k symbols, and the biggest list is under the the key s. Using company-ctags, it takes me around 2.5 secs to load the tags file, and showing the candidates for sym is immediate.

If using the algorithm in this PR, the amount of symbols under every key is:

key numbers percentage
_   909015  8.27%
e   698223  6.35%
t   633655  5.76%
s   607657  5.53%
a   593694  5.40%
r   588523  5.35%
...
b   255192  2.32%
...
Total: 10998052

So the total number of symbols in the dict increases to about 11.7x, and the biggest list expands to 8.8x. Notice that almost all symbols has a _ in them, so they are all under the key _. I think this is not good since if you type something starts with _ or e, you basically go through most or all symbols. Using the patch of this PR, it takes me 25 secs to load the tags file, and showing the candidates for _asm takes 10 secs.

If using the algorithm in this PR, but using 2 chars as keys, the amount of symbols under every key is:

key numbers percentage
_s  242803  1.55%
t_  234767  1.50%
_c  210243  1.34%
re  205406  1.31%
e_  203532  1.30%
_p  181134  1.16%
...
Total: 15633543

So the total number of symbols in the dict increases to about 16.6x, and the biggest list expands to 2.3x. I think this means the speed of completion shouldn't be affected much. Though this is not implemented, I tested completing bank with the patch in this PR, you can see (in the second data) the list under it is roughly the same size as the list under _s. It takes like 0.7 or 0.8 secs for the candidates to show up, and I think this is perfectly fine for a project like the Linux kernel.

The disadvantage is it must take much longer to load the tags file. Since we already see this phenomenon, we know it's pushing strings to the hash table, rather than open the tags file itself, takes most time.

So in summary, for quick non-prefix completion, we have to use 2 chars as keys, but that makes loading the tags file extremely slow, and my thought about this is how about not actually pushing strings to the hash table? how about we build a data structure of the strings (like just using a hash table of 1-char keys) and push the index in the 2-char hash table? I'll continue to experiment with these ideas, and if I get some results I'll post them here.

redguardtoo commented 4 years ago

it's done.