utk-se / WorldSyntaxTree

Language-agnostic parsing of World of Code repositories
Other
20 stars 0 forks source link

Neo4j: String text limited to ~4k bytes #4

Closed robobenklein closed 3 years ago

robobenklein commented 3 years ago

For some extremely large files the root WSTNode won't contain the entire the file content.

robobenklein commented 3 years ago

https://neomodel.readthedocs.io/en/latest/properties.html#id2

robobenklein commented 3 years ago

what should we do when the text content of a node is larger than 4k bytes? ideas:

robobenklein commented 3 years ago

@solsane @AZHenley opinions? how important is it to have the full text for huge files?

each smaller node will still have it's own text content so long as there exist nodes smaller than 4k, but there could be an instance of a single comment larger than 4k bytes

argvrutter commented 3 years ago

Here's a thought. So, afaik for tree sitter, each node includes full text for the node. I'm imagining a setup where each node would approximately occupy one token of the text, but still collectively contain the whole corpus. The two questions that this raises is 1) If there's an intuitive way to accomplish this via tree sitter, ie just collecting from terminal nodes 2) how this would interfere with text based queries.

argvrutter commented 3 years ago

Beyond that, I would probably opt to limit the size to 4k bytes. I'm not entirely certain how it's set up now, but I would imagine that having a text field associated with a syntax node wouldn't add too much complexity. On the other hand, I'm not sure if the graphdb is creating string->string relationships, which if so maybe that isn't neccesary.

robobenklein commented 3 years ago

Right now the text property exists on the type WSTText, which is a node dedicated to just storing text content.

Every WSTNode node points to a respective WSTText node, the problem is that the DB can't store a single massive text property in any kind of node. (I think with the current indexing type, the limit is ~8KB?)

But this is almost always just a problem for the outermost WSTNode nodes, since those are the only ones that contain the entire content of the file. All the smaller nodes fit fine, except for some cases where a single comment might be thousands of characters?

robobenklein commented 3 years ago

I think I found a better solution to retain full text content:

Split WSTText into WSTUniqueText and WSTHugeText:

WSTUniqueText is used when the content is small enough to index (i.e. tsnode text is smaller than ~4k) and otherwise it is stored in a non-indexed (potentially duplicate / non-unique) WSTHugeText node.

This still works for querying because both kinds of text-holding nodes have the text property, and they are both still labeled as WSTText (neo4j nodes can have multiple labels, cool)

Later on if we want to find duplication statistics for content larger than 4E3 we will need to write a new script to coalesce the relationships between WSTNodes and WSTHugeTexts.