Import files & images to assets

sminnee commented 12 years ago

Any image or file content referenced within imported content should be pulled into the assets folder.

Most logical would be to keep the URL hierarchy of imported assets the same.

None of these assets should correspond to SilverStripe pages; only HTML assets should do that.

stojg commented 11 years ago

Any initial ideas on this?

Hardcoded (or option in CMS UI) list of extensions that should be treated as assets

or

Pull everything that isn't a html or text (after inspecting the mime/type or http header)

any other ideas?

sminnee commented 11 years ago

It sounds like @phptek is going to work on this.

There's a question of how abstract we go on this. On the one hand, from a "sitetree importer" perspective, files are fundamentally different from content and deserve their own separate sub-systems. On the other hand, others such as @aatch are trying to make this module less tightly coupled to the CMS and so perhaps we should see files as just a different kind of import.

My inclination would probably be to use the existing "import schema" system to create an import schema that focused on files. This should fit pretty well with the other work that is going on.

Extend StaticSiteUrlList to fetch the mime-type of each URL it spiders.
Currently an import schema is limited by URL reg-ex. Introduce to this an additional filter, the kind of mime-types that are allowed. I would make the db field a Text field, where you enter one mime-type per line, and wildcards are supported - i.e. both image/* or image/jpeg.
StaticSiteContentSource::getSchemaForURL would need to have have the mime-type passed.
The code for actually importing a file should be stored in an alternative to StaticSitePageTransformer - I guess we call it StaticSiteFileTransformer. That class would need to get referenced by StaticSiteImporter, but this overlaps with my suggestion to @aatch in #12, so maybe see what his likely timeline on this is and coordinate with him.
As long as you apply StaticSiteDataExtension to File and fill out its StaticSiteURL field, it shouldn't be too hard to amend StaticSiteRewriteLinksTask to create the file-URL shortcodes as needed, following the same pattern. Note that I'm not particularly happy with the keeping that code just in a build-task, but refactoring that is out of scope of this ticket, I'd say.

sminnee commented 11 years ago

I just remembered another wrinkle here: currently StaticSiteCrawler only spiders HTML URLs, and that would need to be extended. We could spider everything; the list would get quite long but perhaps that's not so bad if the list is filtered at a later stage.

I would be inclined to have a whitelist of imported assets, otherwise you'll get a lot of JS/CSS too.

phptek commented 11 years ago

FYI having had a quick look at some of the logic, it appears that PHPCrawler has its own regex-based mime filtering in PHPCrawler#addContentTypeReceiveRule() which can be used to replace line StaticSiteUrlList.php:507

phptek commented 11 years ago

@sminnee I'm guessing there is a typo above:

"The code for actually importing a file should be stored in an alternative to ~~StaticSiteFileTransformer~~ StaticSitePageTransformer - I guess we call it ~~StaticSitePageTransformer~~ StaticSiteFileTransformer"

sminnee commented 11 years ago

Yep, that's a typo, I've edited my original comment to reduce confusion :-)

phptek commented 11 years ago

Status update

See: https://github.com/phptek/silverstripe-staticsiteconnector/tree/assets-import
Note: This is still very much a W.I.P and very likely won't work as-is:
StaticSiteUrlList now caches Mimes alongside URLs
Schemas' rules now have Mime-type input and mime post-processing logic
Module is now "File aware" with the correct extension applied to it in _config and its transformer is ref'd by StaticSiteImporter
Schemas can be fetched on a URL and URL+Mime basis

ToDo

Physically save files to SS' assets
Modify StaticSiteRewriteLinksTask to rewrite shortcode links
Test!

stojg commented 11 years ago

As a note on physically saving a file, I did a quick POC and ended up doing it like this (other ways where failing weirldy enough)

 <?php
 set_time_limit(0);
 $fp = fopen (dirname(__FILE__) . '/localfile.tmp', 'w+');//This is the file where we save the    information
 $ch = curl_init(str_replace(" ","%20",$url));//Here is the file we are downloading, replace spaces with %20
 curl_setopt($ch, CURLOPT_TIMEOUT, 50);
 curl_setopt($ch, CURLOPT_FILE, $fp); // write curl response to file
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_exec($ch); // get curl response
 curl_close($ch);
 fclose($fp);

Source: http://stackoverflow.com/questions/6409462/downloading-a-large-file-using-curl

phptek commented 11 years ago

Progress To Date

The URL processor fetches and caches both URLs and MimeTypes
Added a new MimeTypeProcessor class containing many utility functions to process and match mimes/extensions and SilverStripe class types
Added MimeTypes entry field on each import schema CMS UI with basic mime-type validation against default extensions found in SilverStripe's File class. Mimes can be comma, space or newline separated
Added a File-specific transformer class, which deals to both files and images. Both types are imported into their own folder in assets, for ease of manual post processing
Added basic logging for imports in Dev mode
Updated README with info in Files and Images and added a TODO

I have successfully crawled and imported a large site into SilverStripe fetching pages, excel spreadsheets and MS word docs, but for reasons beyond the realms of human understanding, it is ignoring images and PDFs and this by temporarily commenting out File#validate(). I tried overloading that in the module's File DataExtension = no dice.

Once this has been tidied up then the (admittedly messy) branch will be rebased properly. The link rewriting is the next task on the list, but still a wee way to go yet with this.

Note: Not sure why .xls s/sheets are being imported if ref's to them are not stored in the cache file... Note2: I am a cretin...ignore the above

phptek commented 11 years ago

Progress

As per the above and in addition:

All files of those Mime-Types inputted to each schema are imported into temp files, copied across to the 'assets' directory and appear in the CMS "Files" admin. Great.

However: At this point there is an issue with the images and some of the documents (PDFs etc) in that they don't seem to be encoded correctly or are corrupted.

Opening the CMS and attempting to list PNGs will yield a "PNG file corrupted by ASCII conversion" error from SilverStripe (Actually from GD which in turn comes from libpng). This commonly occurs with incorrect FTP transfer fro source to server using ASCII transfer instead of binary. However, all the equivalent raw images from the crawled site, obviously show A-OK in a browser. It's only after we fetch, parse and store them using curl etc, that we seem to get problems.

Further PNG specific investigation has gotten this far:

PNGs appear to be corrupt even before being written-to the F/S in StaticSiteContentExtractor#prepareTempFileForUpload()
All temp files do have data in them.:

Right after curl_exec:

if($this->mime == 'image/png') {
    var_dump($fullResponseBody);
    die;
}

string(13315) "HTTP/1.1 200 OK
Date: Fri, 21 Jun 2013 23:55:53 GMT
Content-Type: image/png
...
Server: Microsoft-IIS/6.0
MicrosoftSharePointTeamServices: 12.0.0.6318
X-Powered-By: ASP.NET
...
Public-Extension: http://schemas.microsoft.com/repl-2
Content-Length: 12801
Connection: Keep-Alive

�PNG

�� IHDR��|2�#��tEXtSoftware�Adobe ImageReady....(etc)

Also tried stripping out \r\n from $fullResponseBody in StaticContentExtractor#curlRequest() = no dice

After post-procssing the headers and body returned from curl $this->content in prepareTempFileForUpload() comes from SS_HTTPResponse->getBody()

var_dump($this->content)
<89>PNG
^Z
^@^@^@IHDR\
...(etc)

First byte appears correct for a PNG but the file utility disagrees:

#> file -i /tmp/tmpEZ6pAl
/tmp/tmpEZ6pAl: application/octet-stream; charset=binary
#> convert /tmp/tmpEZ6pAl /tmp/tmpEZ6pAl.png
convert: improper image headerv

// Double checking this is PNG-esque data in the temp file by reading it into less (yes it's binary) but we can see the first byte is \x89 followed by "PNG" (etc)

Additional, random ideas:

// in function prepareTempFileForUpload()
// This ends up just writing an empty file
if($this->mime == 'image/png') {
    $image = imagecreatefromstring($this->content);
    if(imagepng($image,$tmp_name)) {
        $this->setTmpFileName($tmp_name);
        imagedestroy($image);
        return true;
    }
}

If we dump the output of imagepng()

var_dump(imagepng($image,$tmp_name));
"imagecreatefromstring(): Data is not in a recognized format"

These do nothing to help (I had to check though!)

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);

If I get more time this w/e I'll have another crack but I suspect we may need to dive into some command-line fu and break out the hexeditors, or better just figure out exactly what is wrong with the binary data and preg_replace it.

phptek commented 11 years ago

I have sorted the issue. The culprit?

curl_setopt($ch, CURLOPT_HEADER, 1);

Set that do zero (for files, still need it for text/html) and we're golden.

phptek commented 11 years ago

Re-tested and all image and document types are fetched, stored and "uploaded" correctly. Viewable A-OK in the CMS. Fixes pushed to assets-import branch.

I'll clean-up the branch, rebase and issue a PR when I'm back in the office next week.

phptek commented 10 years ago

This is now working A-Ok in my project fork.

phptek commented 10 years ago

Apropos to a discussion somewhere both StaticSitePageTransformer and StaticSiteFileTransformer both extend from a common StaticSiteDataTypeTransformer which contains all the logic common to both classes. There is also some scope for creating custom transformers that also extend from it.

See: https://github.com/phptek/silverstripe-staticsiteconnector

sminnee / silverstripe-staticsiteconnector