Closed jhund closed 10 years ago
I had this on my TODO list for a long time but I haven't found a good way to implement this, yet.
Since some parts of the document are parsed multiple times (e.g. a paragraph that is then further parsed with span level parsers), one would need to make sure to include the correct line/column numbers when running the span parsers later.
Another thing to look out for are characters that are stripped away but which need to be taken into account, e.g. list item markers, indentation with spaces/tabs, blockquote markers, ...
I spent a bit of time looking into how we could do it and played with some updates to kramdown:
I agree that column level location could be tricky. However, there is a lot of value in having just line level location information. E.g., Nokogiri also provides just line level info for its Nodes and no column level.
When parsing IDMLStories, I also added the story name to the location information.
I think it's best if we store the location info on each Kramdown::Element instance. I could think of the following options:
el.options[:location] = { :line => 123 }
el.location = { :line => 123, :idml_story => 'u123' }
I would implement the location as a Hash so that we are flexible in what we store there:
As you mentioned, since kramdown parses block-level elements in a first pass, and span-level elements in a second pass, collecting location info needs to happen at multiple levels:
We can add location info to :block elements as we create them. During the :block parsing stage (in #parse_blocks
method), we have access to the global StringScanner instance @src
and can compute line info from it (see below). We'd have to modify each block type's parse_<type>
method to store the line number when adding the element. Here is for example how we would modify the :paragraph parser:
# lib/kramdown/parser/kramdown/paragraph.rb
def parse_paragraph
# NEW: record the start line number before @src moves #pos
start_line_number = @src.current_line_number
result = @src.scan(PARAGRAPH_MATCH)
while !@src.match?(self.class::PARAGRAPH_END)
result << @src.scan(PARAGRAPH_MATCH)
end
result.chomp!
if @tree.children.last && @tree.children.last.type == :p
@tree.children.last.children.first.value << "\n" << result
else
# UPDATED: add :location option when creating kramdown element
@tree.children << new_block_el(
:p, nil, nil, { :location => { :line => start_line_number } }
)
result.lstrip!
@tree.children.last.children << new_element(@text_type, result)
end
true
end
When spans are parsed (:raw_text in #update_tree), we only have access to local StringScanners. So we only get the line offset within the parent block element. To get the correct global line number, we need to add a child span element's line number to it's parent block element's start line number.
We can do this if we replace all instances of Element.new
with a factory method called new_element
. In that factory method, we add the span element's local line number to the global @current_location[:line]
i_var and store it on new :span level elements:
# lib/kramdown/parser/kramdown.rb
def new_element(*args)
el = Element.new(*args)
if :span == Element.category(el)
el.options[:location] ||= {
:line => @src.current_line_number + @current_location[:line] - 1
} if @src && @current_location && @current_location[:line]
end
el
end
The @current_location
i_var is updated at the beginning of the #update_tree
method for every element that already has :location information (would be :block level elements only):
# lib/kramdown/parser/kramdown.rb
def update_tree(element)
@current_location = element.options[:location] if element.options[:location]
…
end
In order get a line number from StringScanner we'd patch it like so:
class StringScanner
# Returns the line number for current charpos.
# NOTE: Expects \n as line endings.
def current_line_number
string[0..charpos].count("\n") + 1
end
end
What do you think of this approach?
Looks good!
Is this something you'd like to add to kramdown? If so, should I submit a pull request?
The only remaining question is how we store the location Hash on Kramdown::Element:
Either way is fine for me. What's your preference?
This would definitely add a benefit to kramdown, I would gladly accept a pull request. The only thing I have to make sure is that every change is backwards-compatible but that should be no problem.
Regarding the location hash: I would store it in #options
.
Sounds good. I will work on a pull request (and include some tests, too)
Thanks!
location information is now part of kramdown.
@gettalong we would like to add location/position information to the kramdown parser for validation purposes, so that we can include line numbers when reporting validation issues with kramdown documents.
In your opinion, what's the best way to go about this, both technically and resource-wise?