kingRodian / SpideR

MIT License
0 stars 0 forks source link

Parsing of XML #3

Closed kingRodian closed 7 years ago

kingRodian commented 7 years ago

Parsing html and xml is no joke, and I've been told regexes do not hold up for this task. Obviously ours do not succeed at separating content and urls. Either we need to get a lib to do this for us easily, or implement one ourselves(not likely). Libxml++, the c++ wrapper for libxml seems like a likely candidate, but the documentation is not good.

kingRodian commented 7 years ago

Using gumbo-parser library for html parsing