Open AmaiKinono opened 4 years ago
This includes multiple sub issues. I made a Wiki page for this topic. https://github.com/universal-ctags/ctags/wiki/Designing:-updating-tags-file-incrementally @AmaiKinono, I assume you can edit the page.
I think about the initial goal. "Adding -C action to readtags" is not so hard task but we need anyway.
$ readtags -t a.tags -C
--fields={...} --kinds-C=... ...
-C
option reproduces the command line used for generating an existing tags file.
The option refers pseudo tags.
For achieving the goal, we have to do: A. adding more pseudo tags B. extending readtags command
I found I implemented TAG_EXTRA_DESCRIPTION and TAG_FIELD_DESCRIPTION ago.
$ u-ctags --list-pseudo-tags
#NAME ENABLED DESCRIPTION
JSON_OUTPUT_VERSION off the version of json output stream format
TAG_EXTRA_DESCRIPTION off the names and descriptions of enabled extras
TAG_FIELD_DESCRIPTION off the names and descriptions of enabled fields
...
I cannot remember when I implemented them.
Anyway, I should turn on the ptags by default.
I saw this in the wiki
Where do timestamps (and content hash values) go
Using hash is obviously another way to do the detection. Do we want one of them or both?
Attaching timestamps to 'F' kind tags is worked:
$ ./ctags --extras=+f -o - main/main.c | grep timestamp
main.c main/main.c 1;" F language:C timestamp:Sat Oct 24 01:31:30 2020
$ ./ctags --output-format=json --extras=+f -o - main/main.c | grep timestamp
{"_type": "tag", "name": "main.c", "path": "main/main.c", "pattern": false, "language": "C", "kind": "file", "timestamp": "Sat Oct 24 01:31:30 2020"}
Should I change the print format?
I think it's good as long as ctags itself could compare between timestamps. But if we assume that client tools may want to use it in some way, I suggest adding the time zone so we could convert it with the Unix time. The encode-time
function in Elisp also requires the time zone.
Btw, is there any reason to create these tags for files, rather than using pseudo tags?
I found the output of stat command includes such a timestamp.
$ stat /
File: /
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: fd00h/64768d Inode: 96 Links: 22
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:root_t:s0
Access: 2020-11-14 04:55:32.965210497 +0900
Modify: 2020-10-31 13:45:40.465843725 +0900
This format may be good enough. For parsing it, we can use strptime. If a platform doesn't have the function, we can use https://github.com/openssh/openssh-portable/blob/master/openbsd-compat/strptime.c .
Btw, is there any reason to create these tags for files, rather than using pseudo tags?
As I wrote in the Wikipage, there are three ways to store timestamps to a tags file.
In the prototyping, I chose F/file because F/file kind is already there. I will enumerate benefits using F/file.
weaknesses of F/file are
If a user gives the list of input files for updating, F/file is enough. If a user gives only existing tags file, the ptags is better. In the latter case, ctags must enumerate input files by reading the tags file.
I cannot do much at once. I would like to explore what we can do more with F/file kind.
I have studied the way to update a file from sed -i.
$ strace -y -o /tmp/LOG -f sed -i -e 's/a/b/g' /tmp/i.tmp
1055322 openat(AT_FDCWD, "/tmp/i.tmp", O_RDONLY) = 3</tmp/i.tmp>
1055322 ioctl(3</tmp/i.tmp>, TCGETS, 0x7ffc82af0ac0) = -1 ENOTTY (Inappropriate ioctl for device)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 lgetxattr("/tmp/i.tmp", "security.selinux", "unconfined_u:object_r:user_tmp_t"..., 255) = 36
1055322 access("/var/run/setrans/.setrans-unix", F_OK) = -1 ENOENT (No such file or directory)
1055322 futex(0x7f5a1647d5d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 futex(0x7f5a1647c2e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDONLY|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 read(4</proc/1055322/task/1055322/attr/fscreate>, "", 4095) = 0
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 write(4</proc/1055322/task/1055322/attr/fscreate>, "unconfined_u:object_r:user_tmp_t"..., 36) = 36
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 umask(077) = 022
1055322 getpid() = 1055322
1055322 openat(AT_FDCWD, "/tmp/sedZXwkGG", O_RDWR|O_CREAT|O_EXCL, 0600) = 4</tmp/sedZXwkGG>
1055322 umask(022) = 077
1055322 fcntl(4</tmp/sedZXwkGG>, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 5</proc/1055322/task/1055322/attr/fscreate>
1055322 write(5</proc/1055322/task/1055322/attr/fscreate>, NULL, 0) = 0
1055322 close(5</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 read(3</tmp/i.tmp>, "b\n\n", 4096) = 3
1055322 fstat(4</tmp/sedZXwkGG>, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
1055322 read(3</tmp/i.tmp>, "", 4096) = 0
1055322 fchown(4</tmp/sedZXwkGG>, 1000, 1000) = 0
1055322 fgetxattr(3</tmp/i.tmp>, "system.posix_acl_access", 0x7ffc82af0a70, 132) = -1 ENODATA (No data available)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 fsetxattr(4</tmp/sedZXwkGG>, "system.posix_acl_access", "\2\0\0\0\1\0\6\0\377\377\377\377\4\0\4\0\377\377\377\377 \0\4\0\377\377\377\377", 28, 0) = 0
1055322 close(3</tmp/i.tmp>) = 0
1055322 write(4</tmp/sedZXwkGG>, "b\n\n", 3) = 3
1055322 close(4</tmp/sedZXwkGG>) = 0
1055322 rename("/tmp/sedZXwkGG", "/tmp/i.tmp") = 0
I have studied the way to update a file from sed -i.
Very cool. In which way this would help us? I mean ctags already has the -a
option, and we have to sort it afterwards. Or are you planning to use it in other parts? Actually my feeling on this is you can use it in the edittags
command you've mentioned.
I would like to withdraw the goal "Adding -C action to readtags". I made a pull request for adding timestamp: field in #2732.
Like the timestamp field, linking libreadtags to ctags is another step for the incremental updating. The multipass parsing is just another feature that requires linking libreadtags to ctags. There are two approaches to implement the multipass parsing: (0) using in-memory database (1) reading an existing tags file
ctags already infrastructure of (0). If you are interested in this idea, search "symtab" in our source tree. (0) is nothing to do with the incremental updating. Like the incremental updating, approach (1) requires linking libreadtags to ctags.
Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating. @AmaiKinono, you wrote you are not familiar with the idea of "multipass parsing." I'm bad in English. So I will show what it is only with the command output of the prototyping.
$ cat input.py
from mymod import X
# ...
$ cat mymod.py
X = 1
$ u-ctags --extras=+r --fields=+r --fields=+K -o 1st-pass.tags input.py mymod.py
$ u-ctags --extras=+r --fields=+r --fields=+K -o 2nd-pass.tags --_hint=1st-pass.tags input.py mymod.py
$ diff -ruN 1st-pass.tags 2nd-pass.tags
--- 1st-pass.tags 2020-11-28 04:19:03.296310967 +0900
+++ 2nd-pass.tags 2020-11-28 04:19:13.281347322 +0900
@@ -9,6 +9,6 @@
!_TAG_PROGRAM_NAME Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL https://ctags.io/ /official site/
!_TAG_PROGRAM_VERSION 5.9.0 /7f84a443/
-X input.py /^from mymod import X $/;" unknown module:mymod roles:imported
+X input.py /^from mymod import X $/;" variable module:mymod roles:imported
X mymod.py /^X = 1$/;" variable roles:def
mymod input.py /^from mymod import X $/;" module roles:namespace
For recoding I would like to show the diff of python.c:-)
diff --git a/parsers/python.c b/parsers/python.c
index 7806a903..a506a044 100644
--- a/parsers/python.c
+++ b/parsers/python.c
@@ -23,6 +23,8 @@
#include "debug.h"
#include "xtag.h"
#include "objpool.h"
+#include "hint.h"
+#include "routines.h"
#define isIdentifierChar(c) \
(isalnum (c) || (c) == '_' || (c) >= 0x80)
@@ -135,7 +137,8 @@ static kindDefinition PythonKinds[COUNT_KIND] = {
{true, 'c', "class", "classes"},
{true, 'f', "function", "functions"},
{true, 'm', "member", "class members"},
- {true, 'v', "variable", "variables"},
+ {true, 'v', "variable", "variables",
+ .referenceOnly = false, ATTACH_ROLES(PythonUnknownRoles)},
{true, 'I', "namespace", "name referring a module defined in other file"},
{true, 'i', "module", "modules",
.referenceOnly = true, ATTACH_ROLES(PythonModuleRoles)},
@@ -1054,6 +1057,75 @@ static bool parseClassOrDef (tokenInfo *const token,
return true;
}
+struct foreachHintData
+{
+ const char *module_name;
+ size_t module_namelen;
+ int kind;
+};
+
+static bool langobjIsInModule (const char *name, hintEntry *hint, void *data)
+{
+ struct foreachHintData *hdata = data;
+
+ const char *f = strrstr (hint->file, hdata->module_name);
+
+ if (f == NULL)
+ return true; /* continue the finding */
+
+
+ size_t len = hdata->module_namelen
+ ? hdata->module_namelen
+ : (hdata->module_namelen = strlen(hdata->module_name)
+ , hdata->module_namelen);
+ const char * suffix = f + len;
+ if (strcmp(suffix, ".py") != 0)
+ return true; /* continue the finding */
+
+ if (!(hint->kind && *hint->kind))
+ return true;
+
+ kindDefinition *kdef = NULL;
+ if (hint->kind[1] == '\0')
+ {
+ /* kind letter */
+ kdef = getLanguageKindForLetter (Lang_python, *hint->kind);
+ }
+
+ if (kdef == NULL)
+ {
+ /* long name of letter */
+ kdef = getLanguageKindForName (Lang_python, hint->kind);
+ }
+
+ if (kdef == NULL)
+ {
+ /* Not found; please continue the finding. */
+ return true;
+ }
+
+ hdata->kind = kdef->id;
+ /* Found; please stop the finding. */
+ return false;
+}
+
+static int resolveKind (const char *name, int moduleIndex)
+{
+ tagEntryInfo *e = getEntryInCorkQueue (moduleIndex);
+ if (!e)
+ return K_UNKNOWN;
+
+ struct foreachHintData data = {
+ .kind = K_UNKNOWN,
+ .module_name = e->name,
+ .module_namelen = 0,
+ };
+
+ foreachHints (name, TAG_FULLMATCH, langobjIsInModule, &data);
+
+ return data.kind;
+}
+
static bool parseImport (tokenInfo *const token)
{
tokenInfo *fromModule = NULL;
@@ -1175,7 +1247,8 @@ static bool parseImport (tokenInfo *const token)
x = (kind:module, role:namespace),
Y = (kind:unknown, role:imported, scope:module:x) */
/* Y */
- int index = makeSimplePythonRefTag (name, NULL, K_UNKNOWN,
+ int kind = resolveKind (vStringValue (name->string), moduleIndex);
+ int index = makeSimplePythonRefTag (name, NULL, kind,
PYTHON_UNKNOWN_IMPORTED,
XTAG_UNKNOWN);
/* fill the scope field for Y */
So I will show what it is only with the command output of the prototyping.
Oh, I got it immediately!
The example you gave is:
x
defined in module M
.x
when parsing it, but not when importing it in another file.I just thought of this: https://github.com/universal-ctags/ctags/pull/2727#issuecomment-734324148
F
, and we don't know their scope when parsing it.F
included in a module. Suppose we could parse include
and generate a reference tag for F
.F
, using the tag for the include
expression.This is kind of a "reversed" example. But then I found that, F
is actually allowed to be included in other modules (like mixin
in Ruby), although it may never happen in real life (The common use of include
is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do.
(0) using in-memory database
I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags
to read a tags file into the cork queue, so we could look it up? And...
(0) is nothing to do with the incremental updating.
Is this because the cork queue is freed after parsing a file?
Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.
So, does this mean we don't use -a
option, but libreadtags
-> cork queue -> merge with new tags -> tags file?
This is kind of a "reversed" example. But then I found that, F is actually allowed to be included in other modules (like mixin in Ruby), although it may never happen in real life (The common use of include is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do. @AmaiKinono
multipass parsing technique
is needed to implement such tags. However, I'm not sure we should do as you wrote.
I cannot say well-defined criteria but I think tagging the name of the included file is enough in this case. Handling included
files may be part of client tools.
(0) using in-memory database
I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags to read a tags file into the cork queue, so we could look it up? And...
If we can load a tags file into the cork symtab, it is the best. But we cannot because the information in a tags file is not enough to reproduce the table.
(0) is nothing to do with the incremental updating.
Is this because the cork queue is freed after parsing a file?
Yes. I'm thinking about extending the current implementation to keep cork queues/tables generated for each input file till the ending of the ctags process. The in-memory database consists of the kept cork queues/tables.
Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.
So, does this mean we don't use -a option, but libreadtags -> cork queue -> merge with new tags -> tags file?
It doesn't mean as you wrote. What I did is nothing to do with -a. For implementing incremental updating, we have to link libreadtags to ctags at least. About incremental updating, what I did is the linking. In addition to linking, I wrote an application utilizing libreadtags in ctags, "multi-pass parsing".
We had some discussion before, see https://github.com/universal-ctags/ctags/issues/2475#issuecomment-621366924. @masatake suggested the following scheme:
2 and 3 can be done by the
-a
option.I want to suggest a way to do 1 in this proposal. My thought is we use a group of pseudo tags to record the last modified time for each source file. It may look like this:
Then we offer an option like
--increment
to do incremental update. With this option, we compare this pseudo tag with the actual timestamp, and find files that are and are not updated (new files are counted as updated ones). This option should do nothing when generating a tags file for the first time.For writing the temporary tags file, maybe we could use readtags like this:
Another problem is to ensure that the command line options are the same. A possible way we've discussed is to also put them in the pseudo tags, and @masatake said this is related to "multipass parsing". I'm not familiar with that idea, but for the scope of incremental tags file update, I think if people are doing this, then the command line is probably generated by a script/makefile/client tool/ctags configuration file, so it wouldn't become a problem if we ask for same options between multiple runs, rather than letting ctags ensure it.