Open sminnee opened 12 years ago
Any initial ideas on this?
or
any other ideas?
It sounds like @phptek is going to work on this.
There's a question of how abstract we go on this. On the one hand, from a "sitetree importer" perspective, files are fundamentally different from content and deserve their own separate sub-systems. On the other hand, others such as @aatch are trying to make this module less tightly coupled to the CMS and so perhaps we should see files as just a different kind of import.
My inclination would probably be to use the existing "import schema" system to create an import schema that focused on files. This should fit pretty well with the other work that is going on.
StaticSiteUrlList
to fetch the mime-type of each URL it spiders.Text
field, where you enter one mime-type per line, and wildcards are supported - i.e. both image/*
or image/jpeg
.StaticSiteContentSource::getSchemaForURL
would need to have have the mime-type passed.StaticSitePageTransformer
- I guess we call it StaticSiteFileTransformer
. That class would need to get referenced by StaticSiteImporter
, but this overlaps with my suggestion to @aatch in #12, so maybe see what his likely timeline on this is and coordinate with him.StaticSiteDataExtension
to File
and fill out its StaticSiteURL
field, it shouldn't be too hard to amend StaticSiteRewriteLinksTask
to create the file-URL shortcodes as needed, following the same pattern. Note that I'm not particularly happy with the keeping that code just in a build-task, but refactoring that is out of scope of this ticket, I'd say.I just remembered another wrinkle here: currently StaticSiteCrawler only spiders HTML URLs, and that would need to be extended. We could spider everything; the list would get quite long but perhaps that's not so bad if the list is filtered at a later stage.
I would be inclined to have a whitelist of imported assets, otherwise you'll get a lot of JS/CSS too.
FYI having had a quick look at some of the logic, it appears that PHPCrawler has its own regex-based mime filtering in PHPCrawler#addContentTypeReceiveRule() which can be used to replace line StaticSiteUrlList.php:507
@sminnee I'm guessing there is a typo above:
"The code for actually importing a file should be stored in an alternative to StaticSiteFileTransformer StaticSitePageTransformer - I guess we call it StaticSitePageTransformer StaticSiteFileTransformer"
Yep, that's a typo, I've edited my original comment to reduce confusion :-)
StaticSiteUrlList
now caches Mimes alongside URLsStaticSiteImporter
StaticSiteRewriteLinksTask
to rewrite shortcode linksAs a note on physically saving a file, I did a quick POC and ended up doing it like this (other ways where failing weirldy enough)
<?php
set_time_limit(0);
$fp = fopen (dirname(__FILE__) . '/localfile.tmp', 'w+');//This is the file where we save the information
$ch = curl_init(str_replace(" ","%20",$url));//Here is the file we are downloading, replace spaces with %20
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
curl_setopt($ch, CURLOPT_FILE, $fp); // write curl response to file
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch); // get curl response
curl_close($ch);
fclose($fp);
Source: http://stackoverflow.com/questions/6409462/downloading-a-large-file-using-curl
File
class. Mimes can be comma, space or newline separatedI have successfully crawled and imported a large site into SilverStripe fetching pages, excel spreadsheets and MS word docs, but for reasons beyond the realms of human understanding, it is ignoring images and PDFs and this by temporarily commenting out File#validate(). I tried overloading that in the module's File DataExtension = no dice.
Once this has been tidied up then the (admittedly messy) branch will be rebased properly. The link rewriting is the next task on the list, but still a wee way to go yet with this.
Note: Not sure why .xls s/sheets are being imported if ref's to them are not stored in the cache file... Note2: I am a cretin...ignore the above
As per the above and in addition:
However: At this point there is an issue with the images and some of the documents (PDFs etc) in that they don't seem to be encoded correctly or are corrupted.
Opening the CMS and attempting to list PNGs will yield a "PNG file corrupted by ASCII conversion" error from SilverStripe (Actually from GD which in turn comes from libpng). This commonly occurs with incorrect FTP transfer fro source to server using ASCII transfer instead of binary. However, all the equivalent raw images from the crawled site, obviously show A-OK in a browser. It's only after we fetch, parse and store them using curl etc, that we seem to get problems.
Further PNG specific investigation has gotten this far:
StaticSiteContentExtractor#prepareTempFileForUpload()
Right after curl_exec:
if($this->mime == 'image/png') {
var_dump($fullResponseBody);
die;
}
string(13315) "HTTP/1.1 200 OK
Date: Fri, 21 Jun 2013 23:55:53 GMT
Content-Type: image/png
...
Server: Microsoft-IIS/6.0
MicrosoftSharePointTeamServices: 12.0.0.6318
X-Powered-By: ASP.NET
...
Public-Extension: http://schemas.microsoft.com/repl-2
Content-Length: 12801
Connection: Keep-Alive
�PNG
��� IHDR�����������|2�#���tEXtSoftware�Adobe ImageReady....(etc)
Also tried stripping out \r\n from $fullResponseBody
in StaticContentExtractor#curlRequest()
= no dice
After post-procssing the headers and body returned from curl $this->content
in prepareTempFileForUpload()
comes from SS_HTTPResponse->getBody()
var_dump($this->content)
<89>PNG
^Z
^@^@^@IHDR\
...(etc)
First byte appears correct for a PNG but the file utility disagrees:
#> file -i /tmp/tmpEZ6pAl
/tmp/tmpEZ6pAl: application/octet-stream; charset=binary
#> convert /tmp/tmpEZ6pAl /tmp/tmpEZ6pAl.png
convert: improper image headerv
// Double checking this is PNG-esque data in the temp file by reading it into less (yes it's binary) but we can see the first byte is \x89 followed by "PNG" (etc)
Additional, random ideas:
// in function prepareTempFileForUpload()
// This ends up just writing an empty file
if($this->mime == 'image/png') {
$image = imagecreatefromstring($this->content);
if(imagepng($image,$tmp_name)) {
$this->setTmpFileName($tmp_name);
imagedestroy($image);
return true;
}
}
If we dump the output of imagepng()
var_dump(imagepng($image,$tmp_name));
"imagecreatefromstring(): Data is not in a recognized format"
These do nothing to help (I had to check though!)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
If I get more time this w/e I'll have another crack but I suspect we may need to dive into some command-line fu and break out the hexeditors, or better just figure out exactly what is wrong with the binary data and preg_replace it.
I have sorted the issue. The culprit?
curl_setopt($ch, CURLOPT_HEADER, 1);
Set that do zero (for files, still need it for text/html) and we're golden.
Re-tested and all image and document types are fetched, stored and "uploaded" correctly. Viewable A-OK in the CMS. Fixes pushed to assets-import
branch.
I'll clean-up the branch, rebase and issue a PR when I'm back in the office next week.
This is now working A-Ok in my project fork.
Apropos to a discussion somewhere both StaticSitePageTransformer
and StaticSiteFileTransformer
both extend from a common StaticSiteDataTypeTransformer
which contains all the logic common to both classes. There is also some scope for creating custom transformers that also extend from it.
See: https://github.com/phptek/silverstripe-staticsiteconnector
Any image or file content referenced within imported content should be pulled into the assets folder.
Most logical would be to keep the URL hierarchy of imported assets the same.
None of these assets should correspond to SilverStripe pages; only HTML assets should do that.