Closed robobenklein closed 3 years ago
what should we do when the text content of a node is larger than 4k bytes? ideas:
WST_NO_TEXT_AVAILABLE
order
property and attach multiple WSTText objects to a single node@solsane @AZHenley opinions? how important is it to have the full text for huge files?
each smaller node will still have it's own text content so long as there exist nodes smaller than 4k, but there could be an instance of a single comment larger than 4k bytes
Here's a thought. So, afaik for tree sitter, each node includes full text for the node. I'm imagining a setup where each node would approximately occupy one token of the text, but still collectively contain the whole corpus. The two questions that this raises is 1) If there's an intuitive way to accomplish this via tree sitter, ie just collecting from terminal nodes 2) how this would interfere with text based queries.
Beyond that, I would probably opt to limit the size to 4k bytes. I'm not entirely certain how it's set up now, but I would imagine that having a text field associated with a syntax node wouldn't add too much complexity. On the other hand, I'm not sure if the graphdb is creating string->string relationships, which if so maybe that isn't neccesary.
Right now the text property exists on the type WSTText, which is a node dedicated to just storing text content.
Every WSTNode node points to a respective WSTText node, the problem is that the DB can't store a single massive text property in any kind of node. (I think with the current indexing type, the limit is ~8KB?)
But this is almost always just a problem for the outermost WSTNode nodes, since those are the only ones that contain the entire content of the file. All the smaller nodes fit fine, except for some cases where a single comment might be thousands of characters?
I think I found a better solution to retain full text content:
Split WSTText
into WSTUniqueText
and WSTHugeText
:
WSTUniqueText
is used when the content is small enough to index (i.e. tsnode text is smaller than ~4k) and otherwise it is stored in a non-indexed (potentially duplicate / non-unique) WSTHugeText
node.
This still works for querying because both kinds of text-holding nodes have the text
property, and they are both still labeled as WSTText
(neo4j nodes can have multiple labels, cool)
Later on if we want to find duplication statistics for content larger than 4E3 we will need to write a new script to coalesce the relationships between WSTNode
s and WSTHugeText
s.
For some extremely large files the root WSTNode won't contain the entire the file content.