nathell / clj-tagsoup

A HTML parser for Clojure.
Other
181 stars 22 forks source link

Add usage examples #14

Open ustun opened 9 years ago

ustun commented 9 years ago

Once the html is parsed, how can most efficiently query the parsed document? That is, I would want to be able to drill down as if it were a map:

(get-in x [:html :head :title])

It would be great if you added some recommendations how to do that transformation (for example https://github.com/cjohansen/hiccup-find looks promising).

collinalexbell commented 9 years ago

Ditto this. As a clojure noob, this vector thing confuses the hell out of me

nathell commented 9 years ago

Thanks for chiming in!

A quick-and-dirty solution could be something along the lines of (untested, might be buggy):

(defn get-in-html [tree [tag & tags]]
  (if tag
    (when tree
      (recur (first (filter #(= (first %) tag) (rest tree))) tags))
    tree))

Note that you'd want to call it as (get-in x [:head :title]), bypassing the :html.

This is very simplistic and only supports seqs of tags. If you want to extract arbitrary subtrees, you may want to take a look at Enlive. (When I have free time, I intend to explore the possibility of integrating clj-tagsoup and Enlive, as I feel that both projects might benefit from this.)