Open leamingrad opened 4 months ago
~I've just noticed an issue with the position information around line boundaries - I'll fixup and then mark as ready for review again~
Edit: On further checking there doesn't seem to be an issue, but I have updated the tests to make things clearer
The size of the diff here held me off from looking too closely for a while...
Have you considered other approaches to speeding up span_to_position
? The additional data you're looking to include in the AST is pretty sizable as it touches all nodes, and it's duplicating the position info that's in the existing start
and end
indices.
For one possible approach, for the JS yaml
library I ended up defining a lineCounter
option and a LineCounter class
that automatically tracks newlines, and using them, makes the index -> row/col position mapping much faster.
Could that sort of an approach work for you as well?
No problem about the delay, thanks for getting back to me.
I think there are definitely other approaches which could speed up span_to_position
inside fluent-compiler
itself, either something that looks a bit like this PR or something like a LineCounter
class.
My original thinking was that by providing tools to enable that inside python-fluent
would be generally useful - do you think there is anything it would be useful for python-fluent
to provide?
I would prefer an approach that added the least cost to users who do not need the line/column position for all nodes. Always calculating the positions during the parse seems like mostly unnecessary work, while providing some kind of side channel for getting the newline indices during the parse seems like a much lighter addition to fluent.syntax
that would also avoid duplicating information in the parsed results.
(Note: The original context for this PR is covered in https://github.com/django-ftl/fluent-compiler/issues/32, but the key parts are copied in here)
Hi 👋. Firstly, thanks for this library, and for the project in general!
I'm currently working with
python-fluent
(viafluent-compiler
anddjango-ftl
) as part of a large Django project. Its generally been great, but unfortunately parsing and compiling our fluent files contributes quite a lot (~10s) to our apps startup time.With that in mind, I was hoping to contribute a couple of optimisations to
fluent-compiler
andproject-fluent
to speed up the handling of large fluent files.One of the biggest contributor to compile times in
fluent-compiler
is thespan_to_position
function. This function takes thestart
of aSpan
node, and coverts it into (row
,column
) tuple. However in order to do this, it needs to scan through the source for all newlines to work out how many rows their have been.This PR is an attempt to solve that problem by returning more information from the parsing code in
fluent.syntax
. Specifically, we add a new type of AST node to represent a row/column position in the source. We then track the current row/column while scanning through the source stream, and store that inSpan
nodes as they are created.I've tried to make this PR as easy-to-review as possible, but unfortunately there is still quite a large diff, so if there is anything which would be helpful in terms of breaking commits up/squashing them together, please let me know.
I'm not entirely clear on whether the structure of the AST nodes is part of the fluent spec, or if it is something which
python-fluent
can decide unilaterally. If this turns out to be an insurmountable problem, I can switch to trying to solve this insidefluent-compiler
instead.(Note: This PR will result in a one-line merge conflict with #201, so I will rebase as needed if one is merged first)