sminnee / silverstripe-staticsiteconnector

Connector plugin for the SilverStripe External Content module that uses web scraping to import content.
8 stars 12 forks source link

RFC: Post-processing extracted data. #13

Open Aatch opened 11 years ago

Aatch commented 11 years ago

NOTE This relies on #12

Use Case

Extraction from web pages is understandably limited to the level of individual elements. There is a use case from possibly requiring further processing of the data, within that element. A trivial example is stripping presentation elements from an otherwise plain block of text.

<div class="extractme">
  <font color="magenta">HI</font>
  <b>I am a block of poorly marked-up content</b><br />
  </hr>
  <p>But I might exist on the web</p>
</div>

If I only care about the text of that block, then I would like to remove all the extraneous tags.

Proposal

Extend the API from #12 to provide hooks to allow abitrary post-processing of the data.

Details

This would just introduce a layer between the extraction and storage stages. All being said, this functionality could very well drop out naturally during the proposed refactoring in #12.

sminnee commented 11 years ago

Currently the following code in StaticSiteContentExtractor does a part of what you're describing:

            if($extractionRule['excludeselectors']) {
                foreach($extractionRule['excludeselectors'] as $excludeSelector) {
                    $element = $this->phpQuery[$extractionRule['selector'].' '.$excludeSelector];
                    if($element) {
                        $remove = $element->htmlOuter();
                        $content = str_replace($remove, '', $content);
                    }
                }
            }

            if($content) {
                if(!empty($extractionRule['plaintext'])) {
                    $content = Convert::html2raw($content);
                }

                $output[$field] = $extractionRule;
                $output[$field]['content'] = $content;
                $this->log("Found value for $field");
                break;
            }

In particular, it deletes inappropriate content by selector. My inclination would be to take an approach that is similar to the URL rewriting. Create an interface such as StaticSiteContentProcessor that takes a piece of HTML content and returns processed HTML content. Provide some mechanism for passing in arguments (such as excludeselectors). At the simplest implementation, you could simply pass in all the arguments of StaticSiteContentSource_ImportRule, and demand that any additional meta-data fields are added to that class. Note as extensible, but easier to write in the first instance.

Then amend the GUI so that it lets users select one of the implementors of StaticSiteContentProcessor for each imported field - there would probably be a ContentProcessor argument as a field of StaticSiteContentSource_ImportRule.

I would probably set it up so that the ContentProcessor was passed a phpQuery object and returned a manipulated phpQuery object. I would also refactor the current excludeselectors code into an initial StaticSiteContentProcessor object.