sminnee / silverstripe-staticsiteconnector

Connector plugin for the SilverStripe External Content module that uses web scraping to import content.
8 stars 12 forks source link

RFC: Weaken dependency on the CMS #12

Open Aatch opened 11 years ago

Aatch commented 11 years ago

Use Case

At PocketRent, we want to crawl a remote site and extract data from it. However, our use case is for building a database, rather than importing remote content to the CMS. In fact, we don't even have the CMS module in our code base. Thus, being able to simply remove the CMS-specific code would allow us to use this module without pulling in the entire CMS module to do so, we would merely delete the unused files.

Proposal

Refactor the code to move all of the crawling, extracting and storing functionality into it's own separate sub-system, then re-implement the current CMS functionality on top of it.

Details

The current code, from what I can see, is fairly separated already. This means that the refactoring would mostly consist of making the separation explicit.

A second, lower-level, public API would also be created. This lower-level API would expose the crawling, extraction and storage functionality in a way the allows developers to implement their own control mechanisms. This API would also be (almost) entirely self-contained, relying only on phpQuery.

No current functionality would be significantly changed, this is merely a refactoring project that allows for more generic use of the existing code.

sminnee commented 11 years ago

Hi Aatch,

The current code is fairly tightly bound to the externalcontent module and I don't think it would make much sense to separate that. The externalcontent module hopefully doesn't require the CMS and if not, I would probably attack the CMS dependency there. There's no reason that externalcontent needs CMS - it should be usable as a general connection between the SilverStripe ORM and 3rd party data sources. The one thing that is baked into it is the notion of hierarchical data; perhaps this can be resolved, but I would suggest resolving it within externalcontent.

I'm concerned about your comments "we'll create another lower level API", and although I wouldn't necessarily say it's a bad idea until I see actual code, I'm not hearing a great advantage for most users of this module, and instead additional complexity.

sminnee commented 11 years ago

Another way of looking at this is "what impact does the URL have on the imported data?"

Right now, this is hard-coded - the URL is split by token, the missing URLs are filled out, and the parent URL is used as a starting point, and StaticSiteContentItem::stageChildren() is used to recursively import child nodes at each point.

I think for an importer based on URLs, maintaining the existence of a hierarchy is hard to get away from and I'd expect any attempts to do so to end up as a separate project.

Given this, I wouldn't expect the code behind StaticSiteContentItem to change much: it's basically just ties a URL hierarchy into the external content system off the back of StaticSiteUrlList.

The meat of your refactoring would then probably happen in StaticSitePageTransformer and StaticSiteImporter. The system already allows for the possibility that different data-objects are imported, we just happen to have hard-coded getExternalType()'s value to "sitetree".

In summary, the refactoring could involve the following:

sminnee commented 11 years ago

OK, so I've had a look at what happens if you install staticsiteconnector and external-content without the framework, and the good news is that it doesn't look insurmountable:

If you resolve those 3 things then you can use the external content system without CMS.

Aatch commented 11 years ago

@sminnee I haven't taken a proper look at external-content, so I'll take a closer look at that. It seems it's more generic than I though. It also seems that it may be more coherent to change external-content to support our use case, and just remove any duplicated logic.

I'll check it out properly and amend my proposal.

sminnee commented 11 years ago

Cool, in general where looking at external-content as a pretty extensible tool for connecting 3rd party systems through a common interface, so it makes sense for us to enrichen it for your needs.

phptek commented 11 years ago

@aatch Have you had any further thoughts on StaticSiteDataTransformer as alluded-to by @sminnee ? Only it kinda goes against the ideas put forward in #1 unless specific forks in the logic were catered for, for different "classes" of content e.g. "Page" ("SiteTree") and "File".

At this early stage of my own dev, I'm rather more keen on a staged approach - implement a close-copy of StaticSitePageTransformer as StaticSiteFileTransformer and see what kind of issues we encounter with that before abstracting too far/much.

Thoughts?

I'm obviously keen to avoid duplication of effort seeing as we seem to be working on similar functionality. Feel free to contact me.

sminnee commented 11 years ago

@phptek my idea was that StaticSiteDataTransformer was basically a common base class of StaticSitePageTransformer and StaticSiteFileTransformer. If there's not a lot of code, then it's probably not necessary, but I suspected there might be some cross-over.