projectfluent / python-fluent

Python implementation of Project Fluent
https://projectfluent.org/python-fluent/
Other
210 stars 26 forks source link

Add source position information to span nodes #202

Open leamingrad opened 4 months ago

leamingrad commented 4 months ago

(Note: The original context for this PR is covered in https://github.com/django-ftl/fluent-compiler/issues/32, but the key parts are copied in here)

Hi 👋. Firstly, thanks for this library, and for the project in general!

I'm currently working with python-fluent (via fluent-compiler and django-ftl) as part of a large Django project. Its generally been great, but unfortunately parsing and compiling our fluent files contributes quite a lot (~10s) to our apps startup time.

With that in mind, I was hoping to contribute a couple of optimisations to fluent-compiler and project-fluent to speed up the handling of large fluent files.

One of the biggest contributor to compile times in fluent-compiler is the span_to_position function. This function takes the start of a Span node, and coverts it into (row, column) tuple. However in order to do this, it needs to scan through the source for all newlines to work out how many rows their have been.

This PR is an attempt to solve that problem by returning more information from the parsing code in fluent.syntax. Specifically, we add a new type of AST node to represent a row/column position in the source. We then track the current row/column while scanning through the source stream, and store that in Span nodes as they are created.

I've tried to make this PR as easy-to-review as possible, but unfortunately there is still quite a large diff, so if there is anything which would be helpful in terms of breaking commits up/squashing them together, please let me know.

I'm not entirely clear on whether the structure of the AST nodes is part of the fluent spec, or if it is something which python-fluent can decide unilaterally. If this turns out to be an insurmountable problem, I can switch to trying to solve this inside fluent-compiler instead.

(Note: This PR will result in a one-line merge conflict with #201, so I will rebase as needed if one is merged first)

leamingrad commented 4 months ago

~I've just noticed an issue with the position information around line boundaries - I'll fixup and then mark as ready for review again~

Edit: On further checking there doesn't seem to be an issue, but I have updated the tests to make things clearer

eemeli commented 2 months ago

The size of the diff here held me off from looking too closely for a while...

Have you considered other approaches to speeding up span_to_position? The additional data you're looking to include in the AST is pretty sizable as it touches all nodes, and it's duplicating the position info that's in the existing start and end indices.

For one possible approach, for the JS yaml library I ended up defining a lineCounter option and a LineCounter class that automatically tracks newlines, and using them, makes the index -> row/col position mapping much faster.

Could that sort of an approach work for you as well?

leamingrad commented 2 months ago

No problem about the delay, thanks for getting back to me.

I think there are definitely other approaches which could speed up span_to_position inside fluent-compiler itself, either something that looks a bit like this PR or something like a LineCounter class.

My original thinking was that by providing tools to enable that inside python-fluent would be generally useful - do you think there is anything it would be useful for python-fluent to provide?

eemeli commented 2 months ago

I would prefer an approach that added the least cost to users who do not need the line/column position for all nodes. Always calculating the positions during the parse seems like mostly unnecessary work, while providing some kind of side channel for getting the newline indices during the parse seems like a much lighter addition to fluent.syntax that would also avoid duplicating information in the parsed results.