Proposal: A possible way of incrementally updating tags file

AmaiKinono commented 4 years ago

We had some discussion before, see https://github.com/universal-ctags/ctags/issues/2475#issuecomment-621366924. @masatake suggested the following scheme:

Drop tag entries for updated files (reported by git?) and write the remaining tag entries to a temporary tag file (tmp tag file),
Appende tag entries generated by running parsers for the updated files, and writing them to a new tag file,
Merge and Sort the two tags files.

2 and 3 can be done by the -a option.

I want to suggest a way to do 1 in this proposal. My thought is we use a group of pseudo tags to record the last modified time for each source file. It may look like this:

!_SOURCE_FILE_TIMESTAMP!/project/path/something.c   2020-11-11 23:33:28.908535595 +0800 /last modified time/

Then we offer an option like --increment to do incremental update. With this option, we compare this pseudo tag with the actual timestamp, and find files that are and are not updated (new files are counted as updated ones). This option should do nothing when generating a tags file for the first time.

For writing the temporary tags file, maybe we could use readtags like this:

$ readtags -t tags -Q '(not (#/(updated-file1)\|(updated-file2)\|.../ $input))' -Ene > temp_tags

Another problem is to ensure that the command line options are the same. A possible way we've discussed is to also put them in the pseudo tags, and @masatake said this is related to "multipass parsing". I'm not familiar with that idea, but for the scope of incremental tags file update, I think if people are doing this, then the command line is probably generated by a script/makefile/client tool/ctags configuration file, so it wouldn't become a problem if we ask for same options between multiple runs, rather than letting ctags ensure it.

masatake commented 4 years ago

This includes multiple sub issues. I made a Wiki page for this topic. https://github.com/universal-ctags/ctags/wiki/Designing:-updating-tags-file-incrementally @AmaiKinono, I assume you can edit the page.

I think about the initial goal. "Adding -C action to readtags" is not so hard task but we need anyway.

$ readtags -t a.tags -C
--fields={...} --kinds-C=... ...

-C option reproduces the command line used for generating an existing tags file. The option refers pseudo tags.

For achieving the goal, we have to do: A. adding more pseudo tags B. extending readtags command

masatake commented 4 years ago

I found I implemented TAG_EXTRA_DESCRIPTION and TAG_FIELD_DESCRIPTION ago.

$ u-ctags --list-pseudo-tags
#NAME                     ENABLED DESCRIPTION
JSON_OUTPUT_VERSION       off     the version of json output stream format
TAG_EXTRA_DESCRIPTION     off     the names and descriptions of enabled extras
TAG_FIELD_DESCRIPTION     off     the names and descriptions of enabled fields
...

I cannot remember when I implemented them.

Anyway, I should turn on the ptags by default.

AmaiKinono commented 4 years ago

I saw this in the wiki

Where do timestamps (and content hash values) go

Using hash is obviously another way to do the detection. Do we want one of them or both?

masatake commented 4 years ago

Attaching timestamps to 'F' kind tags is worked:

$ ./ctags --extras=+f -o - main/main.c | grep timestamp
main.c  main/main.c 1;" F   language:C  timestamp:Sat Oct 24 01:31:30 2020
$ ./ctags --output-format=json --extras=+f -o - main/main.c | grep timestamp
{"_type": "tag", "name": "main.c", "path": "main/main.c", "pattern": false, "language": "C", "kind": "file", "timestamp": "Sat Oct 24 01:31:30 2020"}

Should I change the print format?

AmaiKinono commented 4 years ago

I think it's good as long as ctags itself could compare between timestamps. But if we assume that client tools may want to use it in some way, I suggest adding the time zone so we could convert it with the Unix time. The encode-time function in Elisp also requires the time zone.

Btw, is there any reason to create these tags for files, rather than using pseudo tags?

masatake commented 4 years ago

I found the output of stat command includes such a timestamp.

$ stat /
  File: /
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 96          Links: 22
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:root_t:s0
Access: 2020-11-14 04:55:32.965210497 +0900
Modify: 2020-10-31 13:45:40.465843725 +0900

This format may be good enough. For parsing it, we can use strptime. If a platform doesn't have the function, we can use https://github.com/openssh/openssh-portable/blob/master/openbsd-compat/strptime.c .

Btw, is there any reason to create these tags for files, rather than using pseudo tags?

As I wrote in the Wikipage, there are three ways to store timestamps to a tags file.

In the prototyping, I chose F/file because F/file kind is already there. I will enumerate benefits using F/file.

We can use binary search as libreadtags provides. If one knows the name of a file whose timestamp the one wants to know, binary-search is helpful especially when one runs ctags on a large source tree including many files. (e.g. linux kernel) If one uses the ptags, one has to use linear-search.
We can keep the size of tags file smaller. To implement jumping-tags-by-file-name, the one needs F/file anyway. Putting the timestamp for the file to the F/file tag looks natural.
Adding more fields are not so hard.

weaknesses of F/file are

You cannot enumerate input files without the linear-search.
This approach makes tagEntryInfo data structure larger.

If a user gives the list of input files for updating, F/file is enough. If a user gives only existing tags file, the ptags is better. In the latter case, ctags must enumerate input files by reading the tags file.

I cannot do much at once. I would like to explore what we can do more with F/file kind.

masatake commented 3 years ago

I have studied the way to update a file from sed -i.

$ strace -y -o /tmp/LOG -f sed -i -e 's/a/b/g'  /tmp/i.tmp
1055322 openat(AT_FDCWD, "/tmp/i.tmp", O_RDONLY) = 3</tmp/i.tmp>
1055322 ioctl(3</tmp/i.tmp>, TCGETS, 0x7ffc82af0ac0) = -1 ENOTTY (Inappropriate ioctl for device)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 lgetxattr("/tmp/i.tmp", "security.selinux", "unconfined_u:object_r:user_tmp_t"..., 255) = 36
1055322 access("/var/run/setrans/.setrans-unix", F_OK) = -1 ENOENT (No such file or directory)
1055322 futex(0x7f5a1647d5d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 futex(0x7f5a1647c2e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDONLY|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 read(4</proc/1055322/task/1055322/attr/fscreate>, "", 4095) = 0
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 write(4</proc/1055322/task/1055322/attr/fscreate>, "unconfined_u:object_r:user_tmp_t"..., 36) = 36
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 umask(077)                      = 022
1055322 getpid()                        = 1055322
1055322 openat(AT_FDCWD, "/tmp/sedZXwkGG", O_RDWR|O_CREAT|O_EXCL, 0600) = 4</tmp/sedZXwkGG>
1055322 umask(022)                      = 077
1055322 fcntl(4</tmp/sedZXwkGG>, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 5</proc/1055322/task/1055322/attr/fscreate>
1055322 write(5</proc/1055322/task/1055322/attr/fscreate>, NULL, 0) = 0
1055322 close(5</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 read(3</tmp/i.tmp>, "b\n\n", 4096) = 3
1055322 fstat(4</tmp/sedZXwkGG>, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
1055322 read(3</tmp/i.tmp>, "", 4096)   = 0
1055322 fchown(4</tmp/sedZXwkGG>, 1000, 1000) = 0
1055322 fgetxattr(3</tmp/i.tmp>, "system.posix_acl_access", 0x7ffc82af0a70, 132) = -1 ENODATA (No data available)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 fsetxattr(4</tmp/sedZXwkGG>, "system.posix_acl_access", "\2\0\0\0\1\0\6\0\377\377\377\377\4\0\4\0\377\377\377\377 \0\4\0\377\377\377\377", 28, 0) = 0
1055322 close(3</tmp/i.tmp>)            = 0
1055322 write(4</tmp/sedZXwkGG>, "b\n\n", 3) = 3
1055322 close(4</tmp/sedZXwkGG>)        = 0
1055322 rename("/tmp/sedZXwkGG", "/tmp/i.tmp") = 0

AmaiKinono commented 3 years ago

I have studied the way to update a file from sed -i.

Very cool. In which way this would help us? I mean ctags already has the -a option, and we have to sort it afterwards. Or are you planning to use it in other parts? Actually my feeling on this is you can use it in the edittags command you've mentioned.

masatake commented 3 years ago

I would like to withdraw the goal "Adding -C action to readtags". I made a pull request for adding timestamp: field in #2732.

masatake commented 3 years ago

Like the timestamp field, linking libreadtags to ctags is another step for the incremental updating. The multipass parsing is just another feature that requires linking libreadtags to ctags. There are two approaches to implement the multipass parsing: (0) using in-memory database (1) reading an existing tags file

ctags already infrastructure of (0). If you are interested in this idea, search "symtab" in our source tree. (0) is nothing to do with the incremental updating. Like the incremental updating, approach (1) requires linking libreadtags to ctags.

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating. @AmaiKinono, you wrote you are not familiar with the idea of "multipass parsing." I'm bad in English. So I will show what it is only with the command output of the prototyping.

$ cat input.py 
from mymod import X 
# ...

$ cat mymod.py 
X = 1
$ u-ctags --extras=+r --fields=+r --fields=+K -o 1st-pass.tags input.py mymod.py 
$ u-ctags --extras=+r --fields=+r --fields=+K -o 2nd-pass.tags --_hint=1st-pass.tags input.py mymod.py 
$ diff -ruN 1st-pass.tags 2nd-pass.tags 
--- 1st-pass.tags   2020-11-28 04:19:03.296310967 +0900
+++ 2nd-pass.tags   2020-11-28 04:19:13.281347322 +0900
@@ -9,6 +9,6 @@
 !_TAG_PROGRAM_NAME Universal Ctags /Derived from Exuberant Ctags/
 !_TAG_PROGRAM_URL  https://ctags.io/   /official site/
 !_TAG_PROGRAM_VERSION  5.9.0   /7f84a443/
-X  input.py    /^from mymod import X $/;"  unknown module:mymod    roles:imported
+X  input.py    /^from mymod import X $/;"  variable    module:mymod    roles:imported
 X  mymod.py    /^X = 1$/;" variable    roles:def
 mymod  input.py    /^from mymod import X $/;"  module  roles:namespace

For recoding I would like to show the diff of python.c:-)

diff --git a/parsers/python.c b/parsers/python.c
index 7806a903..a506a044 100644
--- a/parsers/python.c
+++ b/parsers/python.c
@@ -23,6 +23,8 @@
 #include "debug.h"
 #include "xtag.h"
 #include "objpool.h"
+#include "hint.h"
+#include "routines.h"

 #define isIdentifierChar(c) \
    (isalnum (c) || (c) == '_' || (c) >= 0x80)
@@ -135,7 +137,8 @@ static kindDefinition PythonKinds[COUNT_KIND] = {
    {true, 'c', "class",    "classes"},
    {true, 'f', "function", "functions"},
    {true, 'm', "member",   "class members"},
-   {true, 'v', "variable", "variables"},
+   {true, 'v', "variable", "variables",
+    .referenceOnly = false, ATTACH_ROLES(PythonUnknownRoles)},
    {true, 'I', "namespace", "name referring a module defined in other file"},
    {true, 'i', "module",    "modules",
     .referenceOnly = true,  ATTACH_ROLES(PythonModuleRoles)},
@@ -1054,6 +1057,75 @@ static bool parseClassOrDef (tokenInfo *const token,
    return true;
 }

+struct foreachHintData
+{
+   const char *module_name;
+   size_t module_namelen;
+   int kind;
+};
+
+static bool langobjIsInModule (const char *name, hintEntry *hint, void *data)
+{
+   struct foreachHintData *hdata = data;
+
+   const char *f = strrstr (hint->file, hdata->module_name);
+
+   if (f == NULL)
+       return true;            /* continue the finding */
+
+
+   size_t len = hdata->module_namelen
+       ? hdata->module_namelen
+       : (hdata->module_namelen = strlen(hdata->module_name)
+          , hdata->module_namelen);
+   const char * suffix = f + len;
+   if (strcmp(suffix, ".py") != 0)
+       return true;            /* continue the finding */
+
+   if (!(hint->kind && *hint->kind))
+       return true;
+
+   kindDefinition *kdef = NULL;
+   if (hint->kind[1] == '\0')
+   {
+       /* kind letter */
+       kdef = getLanguageKindForLetter (Lang_python, *hint->kind);
+   }
+
+   if (kdef == NULL)
+   {
+       /* long name of letter */
+       kdef = getLanguageKindForName (Lang_python, hint->kind);
+   }
+
+   if (kdef == NULL)
+   {
+       /* Not found; please continue the finding. */
+       return true;
+   }
+
+   hdata->kind = kdef->id;
+   /* Found; please stop the finding. */
+   return false;
+}
+
+static int resolveKind (const char *name, int moduleIndex)
+{
+   tagEntryInfo *e = getEntryInCorkQueue (moduleIndex);
+   if (!e)
+       return K_UNKNOWN;
+
+   struct foreachHintData data = {
+       .kind = K_UNKNOWN,
+       .module_name = e->name,
+       .module_namelen = 0,
+   };
+
+   foreachHints (name, TAG_FULLMATCH, langobjIsInModule, &data);
+
+   return data.kind;
+}
+
 static bool parseImport (tokenInfo *const token)
 {
    tokenInfo *fromModule = NULL;
@@ -1175,7 +1247,8 @@ static bool parseImport (tokenInfo *const token)
                           x = (kind:module,  role:namespace),
                           Y = (kind:unknown, role:imported, scope:module:x) */
                        /* Y */
-                       int index = makeSimplePythonRefTag (name, NULL, K_UNKNOWN,
+                       int kind  = resolveKind (vStringValue (name->string), moduleIndex);
+                       int index = makeSimplePythonRefTag (name, NULL, kind,
                                                            PYTHON_UNKNOWN_IMPORTED,
                                                            XTAG_UNKNOWN);
                        /* fill the scope field for Y */

AmaiKinono commented 3 years ago

So I will show what it is only with the command output of the prototyping.

Oh, I got it immediately!

The example you gave is:

We have a symbol x defined in module M.
We know the kind of x when parsing it, but not when importing it in another file.
We can know the latter, using the reference tag for the import statement.

I just thought of this: https://github.com/universal-ctags/ctags/pull/2727#issuecomment-734324148

We have symbols defined in file F, and we don't know their scope when parsing it.
We have F included in a module. Suppose we could parse include and generate a reference tag for F.
We can know the scope of all symbols in F, using the tag for the include expression.

This is kind of a "reversed" example. But then I found that, F is actually allowed to be included in other modules (like mixin in Ruby), although it may never happen in real life (The common use of include is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do.

AmaiKinono commented 3 years ago

(0) using in-memory database

I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags to read a tags file into the cork queue, so we could look it up? And...

(0) is nothing to do with the incremental updating.

Is this because the cork queue is freed after parsing a file?

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.

So, does this mean we don't use -a option, but libreadtags -> cork queue -> merge with new tags -> tags file?

masatake commented 3 years ago

This is kind of a "reversed" example. But then I found that, F is actually allowed to be included in other modules (like mixin in Ruby), although it may never happen in real life (The common use of include is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do. @AmaiKinono

multipass parsing technique is needed to implement such tags. However, I'm not sure we should do as you wrote. I cannot say well-defined criteria but I think tagging the name of the included file is enough in this case. Handling included files may be part of client tools.

(0) using in-memory database

I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags to read a tags file into the cork queue, so we could look it up? And...

If we can load a tags file into the cork symtab, it is the best. But we cannot because the information in a tags file is not enough to reproduce the table.

(0) is nothing to do with the incremental updating.

Is this because the cork queue is freed after parsing a file?

Yes. I'm thinking about extending the current implementation to keep cork queues/tables generated for each input file till the ending of the ctags process. The in-memory database consists of the kept cork queues/tables.

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.

So, does this mean we don't use -a option, but libreadtags -> cork queue -> merge with new tags -> tags file?

It doesn't mean as you wrote. What I did is nothing to do with -a. For implementing incremental updating, we have to link libreadtags to ctags at least. About incremental updating, what I did is the linking. In addition to linking, I wrote an application utilizing libreadtags in ctags, "multi-pass parsing".

universal-ctags / ctags

Proposal: A possible way of incrementally updating tags file #2697