tree-sitter / kotlin-tree-sitter

Kotlin bindings to the Tree-sitter parsing library
https://tree-sitter.github.io/kotlin-tree-sitter/
MIT License
13 stars 1 forks source link

Working with 10+MB strings #34

Open sunny-chung opened 1 week ago

sunny-chung commented 1 week ago

Environment:

ktreesitter 0.23.0 Kotlin 1.9 JDK 17

Description:

ktreesitter takes 20 seconds to initialize a 13 MB string. All other stuffs, like incremental updating or querying the AST, work fine and quick. This is what I am using to initialize:

        ast = parser.parse { byte, point ->
            if (byte in 0u until text.length.toUInt()) {
                s.substring(byte.toInt() ..byte.toInt()).let {
                    val codePoints = it.codePoints().toArray()
                    if (codePoints.size > 1 || codePoints.first() > 255) {
                        "X" // replace multibyte char as single-byte char
                    } else {
                        it
                    }
                }
            } else {
                "" // the doc is wrong. null would result in crash
            }
        }

I tried to feed direct strings like below, and it doesn't speed up to an acceptable level, and got troubles with multi-byte characters.

ast = parser.parse(singleByteCharSequence)

My use case is to load and edit a 13 MB JSON with syntax highlighting. If this passes, I will feed in even larger JSON data.

Anything could be done or work around? Not sure if it helps, I am using a rope data structure on JVM side.

ObserverOfTime commented 6 days ago

How long does the CLI take for the same file? (tree-sitter parse foo.json)

sunny-chung commented 5 days ago

About 2 seconds.

> time tree-sitter parse 13mb.json > /dev/null

real    0m2.174s
user    0m1.721s
sys 0m0.449s
sunny-chung commented 3 days ago

Additional information:

If the traditional regular expression approach is used, it takes 765ms to parse the 13 MB JSON in JVM. To parse a 1 MB JSON, ktreesitter takes 1.4s and regular expression takes 60ms.

I guess the overhead is in the round-trip interop I/O time between C and JVM multiplied by the length of the string. The situation might be improved if the parser queries a substring at a time rather than a character.

ObserverOfTime commented 2 days ago

The situation might be improved if the parser queries a substring at a time rather than a character.

That is up to your callback.

sunny-chung commented 2 days ago

Could you elaborate more? The ParseCallback type provided by ktreesitter has an input of a single byte and Point, not a range.

Quoting from ktreesitter's doc:

expect fun parse(oldTree: Tree? = null, callback: ParseCallback): Tree

typealias ParseCallback = (byte: UInt, point: Point) -> CharSequence?
ObserverOfTime commented 2 days ago

You can return a chunk and the position will be offset accordingly.

ObserverOfTime commented 2 days ago

and got troubles with multi-byte characters

Please submit another issue with more details.

the doc is wrong. null would result in crash

Fixed in 7fc4734