sminnee / silverstripe-staticsiteconnector

Connector plugin for the SilverStripe External Content module that uses web scraping to import content.
8 stars 12 forks source link

Import files & images to assets #1

Open sminnee opened 11 years ago

sminnee commented 11 years ago

Any image or file content referenced within imported content should be pulled into the assets folder.

Most logical would be to keep the URL hierarchy of imported assets the same.

None of these assets should correspond to SilverStripe pages; only HTML assets should do that.

stojg commented 11 years ago

Any initial ideas on this?

or

any other ideas?

sminnee commented 11 years ago

It sounds like @phptek is going to work on this.

There's a question of how abstract we go on this. On the one hand, from a "sitetree importer" perspective, files are fundamentally different from content and deserve their own separate sub-systems. On the other hand, others such as @aatch are trying to make this module less tightly coupled to the CMS and so perhaps we should see files as just a different kind of import.

My inclination would probably be to use the existing "import schema" system to create an import schema that focused on files. This should fit pretty well with the other work that is going on.

sminnee commented 11 years ago

I just remembered another wrinkle here: currently StaticSiteCrawler only spiders HTML URLs, and that would need to be extended. We could spider everything; the list would get quite long but perhaps that's not so bad if the list is filtered at a later stage.

I would be inclined to have a whitelist of imported assets, otherwise you'll get a lot of JS/CSS too.

phptek commented 11 years ago

FYI having had a quick look at some of the logic, it appears that PHPCrawler has its own regex-based mime filtering in PHPCrawler#addContentTypeReceiveRule() which can be used to replace line StaticSiteUrlList.php:507

phptek commented 11 years ago

@sminnee I'm guessing there is a typo above:

"The code for actually importing a file should be stored in an alternative to StaticSiteFileTransformer StaticSitePageTransformer - I guess we call it StaticSitePageTransformer StaticSiteFileTransformer"

sminnee commented 11 years ago

Yep, that's a typo, I've edited my original comment to reduce confusion :-)

phptek commented 11 years ago

Status update

ToDo

stojg commented 11 years ago

As a note on physically saving a file, I did a quick POC and ended up doing it like this (other ways where failing weirldy enough)

 <?php
 set_time_limit(0);
 $fp = fopen (dirname(__FILE__) . '/localfile.tmp', 'w+');//This is the file where we save the    information
 $ch = curl_init(str_replace(" ","%20",$url));//Here is the file we are downloading, replace spaces with %20
 curl_setopt($ch, CURLOPT_TIMEOUT, 50);
 curl_setopt($ch, CURLOPT_FILE, $fp); // write curl response to file
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_exec($ch); // get curl response
 curl_close($ch);
 fclose($fp);

Source: http://stackoverflow.com/questions/6409462/downloading-a-large-file-using-curl

phptek commented 11 years ago

Progress To Date

I have successfully crawled and imported a large site into SilverStripe fetching pages, excel spreadsheets and MS word docs, but for reasons beyond the realms of human understanding, it is ignoring images and PDFs and this by temporarily commenting out File#validate(). I tried overloading that in the module's File DataExtension = no dice.

Once this has been tidied up then the (admittedly messy) branch will be rebased properly. The link rewriting is the next task on the list, but still a wee way to go yet with this.

Note: Not sure why .xls s/sheets are being imported if ref's to them are not stored in the cache file... Note2: I am a cretin...ignore the above

phptek commented 11 years ago

Progress

As per the above and in addition:

However: At this point there is an issue with the images and some of the documents (PDFs etc) in that they don't seem to be encoded correctly or are corrupted.

Opening the CMS and attempting to list PNGs will yield a "PNG file corrupted by ASCII conversion" error from SilverStripe (Actually from GD which in turn comes from libpng). This commonly occurs with incorrect FTP transfer fro source to server using ASCII transfer instead of binary. However, all the equivalent raw images from the crawled site, obviously show A-OK in a browser. It's only after we fetch, parse and store them using curl etc, that we seem to get problems.

Further PNG specific investigation has gotten this far:

Right after curl_exec:

if($this->mime == 'image/png') {
    var_dump($fullResponseBody);
    die;
}

string(13315) "HTTP/1.1 200 OK
Date: Fri, 21 Jun 2013 23:55:53 GMT
Content-Type: image/png
...
Server: Microsoft-IIS/6.0
MicrosoftSharePointTeamServices: 12.0.0.6318
X-Powered-By: ASP.NET
...
Public-Extension: http://schemas.microsoft.com/repl-2
Content-Length: 12801
Connection: Keep-Alive

�PNG

 ��� IHDR�����������|2�#���tEXtSoftware�Adobe ImageReady....(etc)

Also tried stripping out \r\n from $fullResponseBody in StaticContentExtractor#curlRequest() = no dice

After post-procssing the headers and body returned from curl $this->content in prepareTempFileForUpload() comes from SS_HTTPResponse->getBody()

var_dump($this->content)
<89>PNG
^Z
^@^@^@IHDR\
...(etc)

First byte appears correct for a PNG but the file utility disagrees:

#> file -i /tmp/tmpEZ6pAl
/tmp/tmpEZ6pAl: application/octet-stream; charset=binary
#> convert /tmp/tmpEZ6pAl /tmp/tmpEZ6pAl.png
convert: improper image headerv

// Double checking this is PNG-esque data in the temp file by reading it into less (yes it's binary) but we can see the first byte is \x89 followed by "PNG" (etc)

Additional, random ideas:

// in function prepareTempFileForUpload()
// This ends up just writing an empty file
if($this->mime == 'image/png') {
    $image = imagecreatefromstring($this->content);
    if(imagepng($image,$tmp_name)) {
        $this->setTmpFileName($tmp_name);
        imagedestroy($image);
        return true;
    }
}

If we dump the output of imagepng()

var_dump(imagepng($image,$tmp_name));
"imagecreatefromstring(): Data is not in a recognized format"

These do nothing to help (I had to check though!)

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);

If I get more time this w/e I'll have another crack but I suspect we may need to dive into some command-line fu and break out the hexeditors, or better just figure out exactly what is wrong with the binary data and preg_replace it.

phptek commented 11 years ago

I have sorted the issue. The culprit?

curl_setopt($ch, CURLOPT_HEADER, 1);

Set that do zero (for files, still need it for text/html) and we're golden.

phptek commented 11 years ago

Re-tested and all image and document types are fetched, stored and "uploaded" correctly. Viewable A-OK in the CMS. Fixes pushed to assets-import branch.

I'll clean-up the branch, rebase and issue a PR when I'm back in the office next week.

phptek commented 10 years ago

This is now working A-Ok in my project fork.

phptek commented 10 years ago

Apropos to a discussion somewhere both StaticSitePageTransformer and StaticSiteFileTransformer both extend from a common StaticSiteDataTypeTransformer which contains all the logic common to both classes. There is also some scope for creating custom transformers that also extend from it.

See: https://github.com/phptek/silverstripe-staticsiteconnector