mnich0ls / evee-sd

An event aggregation app for events in San Diego
1 stars 2 forks source link

Web scraper: sandiego.org #1

Closed mnich0ls closed 5 years ago

mnich0ls commented 5 years ago

Develop a web scraper that can scrape events from the site: https://www.sandiego.org/explore/events.aspx

The properties required for each event are: Name/Title Event Type (festival, music, food, etc.) Location/Neighborhood Date Cost (or free if free event) Description Thumbnail image URL URL to purchase tickets and/or the website for the organization hosting the event

For now, the JSON results can simply be stored locally on the machine doing the scraping, but designed to be posted to another server/service once it is ready.

It is not necessary to scrape events more than a month or two in advance, or whatever is easiest for now.

(Please leave a comment if you need further clarification)

jremi commented 5 years ago

I would have it run the script more frequently... So for example have it run every 2-3 days and just grab the top 15 results... Each time it runs it can check if the record has already been stored... That would be the easiest way... Let me know if that would work... I'm already looking at how the formData is getting posted on that page.

jremi commented 5 years ago

Anyway, I solved your issue. I can easily get all of JSON data on that page. In fact I was able to get all of the events in JSON.... Let me know how much you are willing to throw me in $ and I can give you the info you need asap... Screenshot to show proof/example:

https://imgur.com/a/sJQXB4E

jremi commented 5 years ago

I was able to extract totalResults: 487 (events)

jremi commented 5 years ago

We can take the JSON and manipulate it however you want... https://imgur.com/a/Upe54nL

mnich0ls commented 5 years ago

Hi @jremi , I am very impressed with how quickly you delivered a solution. I've had bids from $100-$500, but no one said they could do it as quickly as you did it. I will be posting other tasks soon if you're interested in those as well. I want to compensate you fairly so that you are motivated to continue working with me. What do you believe is a fair price?

jremi commented 5 years ago

@mnich0ls , sounds good. I would be happy to continue working with you. Related to this first task, before I can deliver final solution and let you know the bid I need to know the following:

  1. Where should the JSON data be stored? For example: Locally stored JSON file, Google Firebase Realtime database or maybe MongoDB?

Note: Based on your initial comment, I can have the data stored to a local JSON file and then when we are ready we can write some code to ... "posted to another server/service once it is ready."

  1. Where do you want the web service to run? For example: Do you have a cloud server we can host the code that will run each month?

Let me know your thoughts on the details.

Regards

mnich0ls commented 5 years ago

@jremi, let's use Firebase Realtime database for now. The collection name would be: scraped_events

You can use the auth token: TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn

I will work on securing this later, for now I'm allowing public read/write.

REST example:

curl -X POST -d '{"title" : "Test scraped event from CLI"}' https://evee-sd.firebaseio.com/scraped_events.json?auth=TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn

{"name":"-L_T4A4hL2e1WoEydHqF"}
curl  https://evee-sd.firebaseio.com/scraped_events.json?auth=TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn

{"-L_T8AQPuNUouGszODz_":{"title":"Test scraped event from CLI"}}

As far as how we host and execute it, what are the dependencies? What did you write it in? What frameworks does it require? Can we host it as a node.js app on Firebase?

jremi commented 5 years ago

If we are going to use Firebase then we can setup a Firebase Google Cloud Function to host the code snippet. The snippet will get triggered 1x monthly. There is a simple cron-job web based dashboard that can be used to easily modify the scheduling of when the "scraper" will run. The Google Cloud Firebase Function I will write is going to be written in NodeJS.

So the proposed flow game-plan:

  1. Deploy Google Cloud Function to Firebase evee-sd project.
  2. Function will do the following: -- Perform GET request against (https://www.sandiego.org/explore/events.aspx) -- Take all of the event data and store directly into Google Firebase Realtime database under node _scrapedevents
  3. Configure cronjob task that will trigger this function on monthly basis.
  4. Update/write any new event records to the Firebase realtime database.

I can do this for $350 since I want to build up a longer term relationship with you. I will build this out on my own test Firebase server now and once it's ready I can send you some demo examples to have you review and see that the task is complete. We can then finalize with delivery via deployment to your firebase project. I can also make code push commit if you have a private repository and will include basic README.md (markdown) containing the basics for usage.

I like to use Venmo for payment. We can do this after the task is fully complete and ready.

Let me know if you agree with this proposed breakdown.

Regards,

mnich0ls commented 5 years ago

This all sounds reasonable to me, however, I believe the function/service would need to be triggered on at least a daily basis (since there may be frequent updates to events). I don't see any reason that would change the scope though.

Before you continue, let me publish a list of the other sites I'm looking at and let me know if any of them look anymore difficult to scrape than this site. If so, how much more complex? I want to be able to estimate how much it will cost to cover these sites.

jremi commented 5 years ago

Google does not allow on the free Firebase Spark plan "Outbound networking". This means you need to have a monthly billing plan to make outbound requests to external non-google services. I only have the free plan for testing. Not sure if your plan is paid or not. Will setup for now on a different service for demo.

jremi commented 5 years ago

For now I deployed the scraper to Heroku Cloud and I connected to my own Firebase realtime database to store the events for my testing.

Examples: https://imgur.com/a/azQwQSs https://imgur.com/a/x2hEAAa

jremi commented 5 years ago

Cronjob scheduling works. I also implemented basic HTTP auth header requirement for invoking the scraper GET endpoint.

https://imgur.com/gUNVxR4 https://imgur.com/a/GonNEgB

jremi commented 5 years ago

I just shot you an email back. Let me know what repo you want this pushed ... or if we push to this repo ... do you want me to create a new directory called webscrapers ?

mnich0ls commented 5 years ago

Open a PR into this directory please: web-scrapers/sandiego.org/

jremi commented 5 years ago

PR opened https://github.com/mnich0ls/evee-sd/pull/7