imazen / repositext

MIT License
4 stars 2 forks source link

Adding location/position information to kramdown parser #15

Closed jhund closed 10 years ago

jhund commented 11 years ago

@gettalong we would like to add location/position information to the kramdown parser for validation purposes, so that we can include line numbers when reporting validation issues with kramdown documents.

In your opinion, what's the best way to go about this, both technically and resource-wise?

gettalong commented 11 years ago

I had this on my TODO list for a long time but I haven't found a good way to implement this, yet.

Since some parts of the document are parsed multiple times (e.g. a paragraph that is then further parsed with span level parsers), one would need to make sure to include the correct line/column numbers when running the span parsers later.

Another thing to look out for are characters that are stripped away but which need to be taken into account, e.g. list item markers, indentation with spaces/tabs, blockquote markers, ...

jhund commented 11 years ago

I spent a bit of time looking into how we could do it and played with some updates to kramdown:

I agree that column level location could be tricky. However, there is a lot of value in having just line level location information. E.g., Nokogiri also provides just line level info for its Nodes and no column level.

When parsing IDMLStories, I also added the story name to the location information.

How to store location information

I think it's best if we store the location info on each Kramdown::Element instance. I could think of the following options:

I would implement the location as a Hash so that we are flexible in what we store there:

Collecting location information

As you mentioned, since kramdown parses block-level elements in a first pass, and span-level elements in a second pass, collecting location info needs to happen at multiple levels:

1. At :block parse time

We can add location info to :block elements as we create them. During the :block parsing stage (in #parse_blocks method), we have access to the global StringScanner instance @src and can compute line info from it (see below). We'd have to modify each block type's parse_<type> method to store the line number when adding the element. Here is for example how we would modify the :paragraph parser:

# lib/kramdown/parser/kramdown/paragraph.rb

def parse_paragraph
  # NEW: record the start line number before @src moves #pos
  start_line_number = @src.current_line_number
  result = @src.scan(PARAGRAPH_MATCH)
  while !@src.match?(self.class::PARAGRAPH_END)
    result << @src.scan(PARAGRAPH_MATCH)
  end
  result.chomp!
  if @tree.children.last && @tree.children.last.type == :p
    @tree.children.last.children.first.value << "\n" << result
  else
    # UPDATED: add :location option when creating kramdown element
    @tree.children << new_block_el(
      :p, nil, nil, { :location => { :line => start_line_number } }
    )
    result.lstrip!
    @tree.children.last.children << new_element(@text_type, result)
  end
  true
end

2. At :span parse time

When spans are parsed (:raw_text in #update_tree), we only have access to local StringScanners. So we only get the line offset within the parent block element. To get the correct global line number, we need to add a child span element's line number to it's parent block element's start line number.

We can do this if we replace all instances of Element.new with a factory method called new_element. In that factory method, we add the span element's local line number to the global @current_location[:line] i_var and store it on new :span level elements:

# lib/kramdown/parser/kramdown.rb
def new_element(*args)
  el = Element.new(*args)
  if :span == Element.category(el)
    el.options[:location] ||= {
      :line => @src.current_line_number + @current_location[:line] - 1
    }  if @src && @current_location && @current_location[:line]
  end
  el
end

The @current_location i_var is updated at the beginning of the #update_tree method for every element that already has :location information (would be :block level elements only):

# lib/kramdown/parser/kramdown.rb
def update_tree(element)
  @current_location = element.options[:location]  if element.options[:location]
  …
end

Patch Stringscanner

In order get a line number from StringScanner we'd patch it like so:

class StringScanner
  # Returns the line number for current charpos.
  # NOTE: Expects \n as line endings.
  def current_line_number
    string[0..charpos].count("\n") + 1
  end
end

What do you think of this approach?

gettalong commented 11 years ago

Looks good!

jhund commented 11 years ago

Is this something you'd like to add to kramdown? If so, should I submit a pull request?

The only remaining question is how we store the location Hash on Kramdown::Element:

  1. In the #options attribute or
  2. in a new #location attribute

Either way is fine for me. What's your preference?

gettalong commented 11 years ago

This would definitely add a benefit to kramdown, I would gladly accept a pull request. The only thing I have to make sure is that every change is backwards-compatible but that should be no problem.

Regarding the location hash: I would store it in #options.

jhund commented 11 years ago

Sounds good. I will work on a pull request (and include some tests, too)

gettalong commented 11 years ago

Thanks!

jhund commented 10 years ago

location information is now part of kramdown.