crawler - Githubissues

hashplan commented 10 years ago

Here is where the crawler scripts are located on the server. I'll include a small write up for basic logic and how to run

romasolot commented 10 years ago

I moved the crawler to the application. Added a button for manual run. Have not added cron-jobs to the cron so far.

hashplan commented 10 years ago

Hi Roma. This is great.

Could you specify how to login to rerun in (where the button is)?

romasolot commented 10 years ago

Log in as admin@admin.com (or here is my user devdevdevdevdevdev727@gmail.com : 12345678, I made him admin as well) and follow the link http://www.hashplans.com/admin/crawler. It's a long process, because there are 30 sec timeouts in queries so that we are not banned. I've made database backup today, just in case

hashplan commented 10 years ago

Roma, I did not run the crawler over the weekend.

Lets nail down stubhub first before moving to using pollstar. For the first iteration of the testing for us and a few users we will open this up to, we will use stubhub as the main source of data.

romasolot commented 10 years ago

I fixed the driver for parsing the stubhub - only those events are parsed which have name and date (removed empty events from the database). I added Crawler run to everyweek cron job. I ran the cron job manually and updated events.

The task is done, however I'm sure events can be probably moved or cancelled. This option is not provided in this parser. There are a couple of ideas regarding this, need to try them and to choose the best. Will you please remind why we do not use stubhub API?. I remember to ask you about it already, but can not find the email.

hashplan commented 10 years ago

We can not store any of the data if we use their API, so we are not using it due to legal reasons.

Lets set up the crawler to run once a week. Most events will probably not change that much day to day and we reduce the risk of being blocked by not running daily.

romasolot commented 10 years ago

Ok, understood. Yes, the cron has already been set up for running the crawler once a week (Monday 2am server tim

hashplan commented 10 years ago

Somehow the crawler does not seem to capture all events. For example, http://www.stubhub.com/search/doSearch?searchStr=all&searchMode=event&rows=50&start=0&nS=0&location=664;New+York+New+York+Metro&ae=1&sp=Date&sd=1&ven=Minskoff+Theatre;Minskoff+Theatre

Our database venues table seems to be missing the Minskoff Theatre as venue in New York. There are no events in the events table for Lion King in New York. Also per crawlerstatus table, New York has not been updated since June 27.

romasolot commented 10 years ago

I'd like to try to move the logic of the stored procedure to php. I assume this should decrease general time of parsing. Will let you know the results.

romasolot commented 10 years ago

Hi Stas

The server was banned by stubhub I manually added several proxy servers, but can not say how long they are going to stay up-to-date. Maybe we'll have to use some service which would provide proxy servers.

I also slightly re-worked the strored procedure. But once I carried out tests I realized that there is no issue with recording to the database. There are sometimes delays up to 30 seconds during connecting to stubhub

romasolot commented 10 years ago

Hi Stas

Seems proxy servers do not resolve the situation: 1) Performance faded much (parsing took more than a day) 2) These proxy are quickly banned as well

There is one more scheme. To deploy an additional instance with the own ip address and to remove it after parsing. and we can also limit parsing time - for example to parse events within 3-6 nearest months.

hashplan commented 10 years ago

We might have to migrate amazon boxes, seems like the current one lasted a few months.

If we set up a new instance and remove after parsing, will we need to set it up again for next parse? We need events updated at least once a month if not more frequently.

romasolot commented 10 years ago

We might have to migrate amazon boxes, seems like the current one lasted a few months.

Not sure if I understand you right

If we set up a new instance and remove after parsing, will we need to set it up again for next parse? We need events updated at least once a month if not more frequently.

Yes, a new instance will automatically be created for each parsing, but using the already saved AMI (with parsing script autorun). Once script is done - instance will be disabled and removed. This will allow minimal costs for keeping instance and a new ip during each parsin

hashplan commented 10 years ago

Roma, Are you talking about a new aws instance which requires a new amazon account to be set up and configured or something else?

I was talking about a new aws account. Our old one (current one) was not blocked for a few months (basically since we started until last week). I think amazon might still have a promotion deal for free aws for a year, but I have to check.

How will what you are thinking of doing deal with the ip being blocked, could you describe a bit more? Would it use the same ip every time it crawls?

romasolot commented 10 years ago

Hi Stas

Not sure I understand you right, anyway, I used to mention that stubhub.com blocked ip address of your AWS server (it's Elastic IP in this case). And there is no way to parse without proxy at the moment. I would propose the following: 1) to change Elastic IP of the main server. So that it does no get banned. (a new IP should be set up for the domain as well) 2) to use a separate instance for the parser (automatic creation using AWS API). So that each time instance is created, parser would work from another ip address. (it is applied only to parser, the main site will function in its own instance)

hashplan / calendar

crawler #23