m4rw3r / chomp

A fast monadic-style parser combinator designed to work on stable Rust.
Apache License 2.0
243 stars 19 forks source link

HTML extract link parser. #52

Closed LeMoussel closed 7 years ago

LeMoussel commented 8 years ago

Do you think is it possible to extract attributs Link (<a> tag element) from HTML document? If yes, can you write/explain an example parser?

m4rw3r commented 8 years ago

Could you explain more in detail what kind of data you have as input and what kind of output you expect? Because in general if you have HTML and care about edge-cases (eg. not parsing links from comments or other places where things can look like they are <a> tags) you will need an HTML-parser and then filter out all the interesting tags and attributes. Obviously writing an HTML-parser is pretty complex if all edge-cases are to be covered.

LeMoussel commented 8 years ago

Not HTML-parser for all edge-cases, but just for <a> tag element. Like describe in W3C document just <A href="#section2" id="test" ... some others attributs>Some stuff</A>. And if it's possible all find attributs for <a> element. (href, id, name, rel, ....).

m4rw3r commented 8 years ago

@LeMoussel Something like this should parse a subset of <a>-tags and their attributes:

#[macro_use]
extern crate chomp;

use std::collections::HashMap;
use std::hash::Hash;
use chomp::prelude::*;
use chomp::ascii::{skip_whitespace, is_whitespace};

#[derive(Debug, Default, PartialEq)]
pub struct Anchor<B: Buffer + Eq + Hash> {
    pub attributes: HashMap<B, Option<B>>,
}

pub fn anchor<'a, I: U8Input<Buffer=&'a [u8]>>(i: I)
  -> SimpleResult<I, Anchor<I::Buffer>> {
    parse!{i;
                token(b'<');
                token(b'a');
        // Utilize the fact that many is based on FromIterator
        let a = many(attr);
                skip_whitespace();
                token(b'>');

        ret Anchor { attributes: a }
    }
}

fn attr<I: U8Input>(i: I) -> SimpleResult<I, (I::Buffer, Option<I::Buffer>)> {
    parse!{i;
                    satisfy(is_whitespace);
        let key   = take_while1(|c| match c {
            b'='  => false,
            b' '  => false,
            b'\t' => false,
            b'>'  => false,
            _     => true,
        });
        let eq    = peek_next();
        let value = i -> if eq == b'=' {
            any(i).then(attr_value).map(Some)
        } else {
            i.ret(None)
        };

        ret (key, value)
    }
}

fn attr_value<I: U8Input>(i: I) -> SimpleResult<I, I::Buffer> {
    parse!{i;
        let quote = peek_next();
        i -> if quote == b'"' || quote == b'\'' {
            parse!{i; token(quote) >> take_while(|c| c != quote) <* token(quote) }
        } else {
            take_while(i, |c| match c {
                b' '  => false,
                b'\t' => false,
                b'>'  => false,
                _     => true,
            })
        }
    }
}

fn main() {
    let a = parse_only(anchor, b"<a href=\"http://www.example.com\">Test</a>");

    for (k, v) in a.unwrap().attributes.iter() {
        println!("{} = {:?}", String::from_utf8_lossy(k), v.map(String::from_utf8_lossy));
    }
}

This will just ignore everything after the > character of the tag though, and it requires that the input is at the beginning of a tag to parse. And the code above will most likely require some more thorough reading of the spec and adjustments following that.

LeMoussel commented 8 years ago

Thank for your help. I test it.