Scraper/Controller - Githubissues

luqmaan commented 11 years ago

Due Monday

luqmaan commented 11 years ago

Ghost.py is way too buggy. I'm testing out this: http://stackoverflow.com/a/15699761/854025

jmfield2 commented 11 years ago

Hmm. That is one way to do it -- I don't know why it is using so much ram and crashing--I haven't seen that yet on my linode with even just 512mb of ram (1024mb now bc they rock)... I think selenium uses a java component, right? You can also try connecting directly to phantom using pipes or something -- I started this but didnt get to integrating it with python quite done yet..

It wont attach here but:

console.js: // echo -e "document.querySelector('BODY').innerHTML\nexit" | phantomjs console.js http://yahoo.com/ fs = require('fs'); command_file = "/dev/stdin";

system = require("system"); if (system.args[1] == undefined) { console.log("\nUSAGE:"); console.log("phantomjs console.js URL\n"); phantom.exit(); }

var data = []; loc = system.args[1];

page = require("webpage").create(); page.open(loc, function(status) {

if (status == "success") {
    setInterval(readCommand, 500);
    readCommand();
}
else {
    console.log(location + " failed to open - " + status);
    phantom.exit();
}

});

function readCommand() { command = fs.read(command_file).split('\n');

for (i in command) {
    line = command[i];

    if (line.length <= 0) continue;

    cmds = line.split(" ");
    if (line == "exit") {
        console.log(JSON.stringify(data));
        phantom.exit();
    }
    else if (cmds[0] == "capture") {        
        data[i] = cmds[1];
        page.render(cmds[1]);
    }
    else {
        // XXX eval() ! XXX
        ret = page.evaluate(function(line) { return eval(line); }, line);
        data[i] = ret;
    }
}

}

test.py: import subprocess from subprocess import PIPE import json

p = subprocess.Popen(["/root/phantomjs1.9/bin/phantomjs", "console.js", "http://yahoo.com"], stdin=PIPE, stdout=PIPE)

p.stdin.write("capture test.png\ndocument.querySelector('BODY').innerHTML\nexit") p.stdin.close()

buf = p.stdout.readlines()

buf = json.loads(''.join(buf))

jmfield2 commented 11 years ago

Another option is to schedule tasks for the scraper to run and process them periodically instead of on-demand ... We would just need a screen placeholder or something instead of a broken image

luqmaan commented 11 years ago

Nice to hear from you ICMP!

I like that last option. When you click the add product button an AJAX request is issued that creates the product. Server responds to acknowledge it is adding it. Server responds again with an error or the image url and price in JSON. I don't think its possible to have the server respond more than once in a single AJAX request.

I've started using selenium + phantom on my Mac. I have yet to test it on my server, but I think it will work.

https://github.com/createch/PriceChecker.py/tree/phantom

Check out the phantom branch. Take a look at scraper2.py. The advantage is using selenium instead of pipes and node.js or phantomjs directly is that this is much easier and seems to be used by at least a couple people.

I've gotten scraper2's product_info() method working!

luqmaan commented 11 years ago

Test phantom out at http://pychecker.com/. Add a product.

luqmaan / PriceChecker.py

Scraper/Controller #6