kschiess / parslet

A small PEG based parser library. See the Hacking page in the Wiki as well.
kschiess.github.com/parslet
MIT License
805 stars 95 forks source link

Request for info #178

Closed vijaynaidu closed 7 years ago

vijaynaidu commented 7 years ago

Hi @kschiess Thanks for the cool plugin. I'm trying to understand how actually to use Parslet for my case. Can you help me with the idea/ syntax on the following cases.

CASE1: Input:

"Band structure "a,b" of graphite,"

Expected output:

{
  :pre=>"\"", 
  :content=> "Band structure \"a,b\" of graphite", 
  :post=> ",\""
}

CASE2:

Input: pp 211–220,.

Expected output:

{
  :pre=>'pp ',
  :firstpage=>211,
  :sep=>'–',
  :lastpage=>220,
  :post=>',.'
}

Thanks

kschiess commented 7 years ago

This is not how it works. This is open source. In general, when you ask for help, I would expect that you show me the code that you've written (your attempt) and tell me what you expect. I'd then point out where we have different assumptions.

I currently don't have the time to help you with this. Maybe try Stack Overflow? Also: The examples directory here has quite a bit of parslet code for you to peruse. Good luck!

vijaynaidu commented 7 years ago

@kschiess Sorry, i apologise for my mistake :( :+1: Thanks for your reply. Sure, would try to get help from other sources

Hope someone might get helpful from this piece that i tried I'm applying Parslet for parsing text and it works cool for segmenting page nos i.e CASE 2. But no idea on how to do the same with CASE 1 i.e parsing title

page = 'pp. S170–S177.'

class PageParse < Parslet::Parser
    root(:page_exp)

    rule(:space) { match('\s').repeat(1) }
    rule(:space?) { space.maybe }

    rule(:dot) { str('.').repeat(1) }
    rule(:dot?) { dot.maybe }

    rule(:comma) { str(',').repeat(1) }
    rule(:comma?) { comma.maybe }

    rule(:alphabet) { match('[A-Za-z]').repeat(1) }
    rule(:alphabet?) { alphabet.maybe }

    rule(:integer) { match('[0-9]').repeat(1) }
    rule(:integer?) { integer.maybe }

    rule(:alpha_numeric) { (alphabet | integer).repeat(1) }
    rule(:alpha_numeric?) { alpha_numeric.maybe }

    rule(:page_label_names){ str('page') | str('pp') | str('p') }
        rule(:page_label_names?){ page_label_names.maybe }

    rule(:page_label){ space? >> page_label_names? >> dot? >> space? }
    rule(:page_end_boundary){  space? >> comma? >> dot? >> space? }
    rule(:page_end_boundary?){  page_end_boundary.maybe }

    rule(:page_no){ alpha_numeric }
        rule(:page_no?){ page_no.maybe }

    rule(:page_seperator){ str('-').repeat(1) | str('–').repeat(1) }
        rule(:page_seperator?){ page_seperator.maybe }

    rule(:page_content){ page_no?.as(:first_page) >> page_seperator?.as(:separator) >> page_no?.as(:last_page) }

    rule(:page_exp){ page_label.maybe.as(:match_pre) >> page_content.maybe.as(:pages) >> page_end_boundary.as(:match_post) }
end

def parse(page)
    PageParse.new.parse(page)
rescue Parslet::ParseFailed => failure
    #return page
    puts failure.parse_failure_cause.ascii_tree
end

pp parse(page)
kschiess commented 7 years ago

Hi,

I've taken a quick look after all. I have a hard time understanding the syntax that underlies this 'title' thing. Apparently, it nests '"' without escaping, so a parser would have to keep reading balanced '"' until it finds the last one in the document? Maybe your difficulty in parsing this comes from the underlying grammar being underdefined.

Maybe that helps? kaspar