lawrencewoodman / mida

A Microdata parser/extractor library for Ruby
http://lawrencewoodman.github.io/mida
Other
77 stars 18 forks source link

allow some way to override the items selector? #24

Open xxx opened 6 years ago

xxx commented 6 years ago

Hi,

Currently running into an issue where a marked-up document is not having its items found, because they all include an itemprop attribute. This prevents them from being selected when Mida::Document#extract_items is called from its constructor, because it explicitly searches for //*[@itemscope and not(@itemprop)]

I'm pretty sure the reason for this is to make sure you're only grabbing the top level of items in a page, but as far as I can tell, it's valid to have an itemprop on a top-level object. On this page, for example, the top level VideoObject has itemprop="video" on it. This page is not rejected by any validator that I've tried on it so far.

I'm currently working around this with the following terrible monkeypatch:

module Mida
  class Document
    private

    def extract_items
      itemscopes = @doc.search('//*[@itemscope]')
      return nil unless itemscopes

      # strip out descendents - we only want the top level
      itemscopes = itemscopes.select do |item|
        item.ancestors('//*[@itemscope]').blank?
      end

      itemscopes.collect do |itemscope|
        itemscope = Itemscope.new(itemscope, @page_url)
        Item.new(itemscope)
      end
    end
  end
end

This works the way I want it to, but I think having to check the ancestors for each hit is horrible. I'm hoping there's a better way to do this.

It's not clear to me that this gem is even maintained anymore, but still putting this up here in hopes someone has a better idea, since this gem provides like everything else I need. It's just this initial select that's issuematic.