Closed mnich0ls closed 5 years ago
I would have it run the script more frequently... So for example have it run every 2-3 days and just grab the top 15 results... Each time it runs it can check if the record has already been stored... That would be the easiest way... Let me know if that would work... I'm already looking at how the formData is getting posted on that page.
Anyway, I solved your issue. I can easily get all of JSON data on that page. In fact I was able to get all of the events in JSON.... Let me know how much you are willing to throw me in $ and I can give you the info you need asap... Screenshot to show proof/example:
I was able to extract totalResults: 487 (events)
We can take the JSON and manipulate it however you want... https://imgur.com/a/Upe54nL
Hi @jremi , I am very impressed with how quickly you delivered a solution. I've had bids from $100-$500, but no one said they could do it as quickly as you did it. I will be posting other tasks soon if you're interested in those as well. I want to compensate you fairly so that you are motivated to continue working with me. What do you believe is a fair price?
@mnich0ls , sounds good. I would be happy to continue working with you. Related to this first task, before I can deliver final solution and let you know the bid I need to know the following:
Note: Based on your initial comment, I can have the data stored to a local JSON file and then when we are ready we can write some code to ... "posted to another server/service once it is ready."
Let me know your thoughts on the details.
Regards
@jremi, let's use Firebase Realtime database for now. The collection name would be: scraped_events
You can use the auth token: TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn
I will work on securing this later, for now I'm allowing public read/write.
REST example:
curl -X POST -d '{"title" : "Test scraped event from CLI"}' https://evee-sd.firebaseio.com/scraped_events.json?auth=TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn
{"name":"-L_T4A4hL2e1WoEydHqF"}
curl https://evee-sd.firebaseio.com/scraped_events.json?auth=TMe7oNMbWHLHcN45t2uYcpltEQCAeHCrKeh8V7Xn
{"-L_T8AQPuNUouGszODz_":{"title":"Test scraped event from CLI"}}
As far as how we host and execute it, what are the dependencies? What did you write it in? What frameworks does it require? Can we host it as a node.js app on Firebase?
If we are going to use Firebase then we can setup a Firebase Google Cloud Function to host the code snippet. The snippet will get triggered 1x monthly. There is a simple cron-job web based dashboard that can be used to easily modify the scheduling of when the "scraper" will run. The Google Cloud Firebase Function I will write is going to be written in NodeJS.
So the proposed flow game-plan:
I can do this for $350 since I want to build up a longer term relationship with you. I will build this out on my own test Firebase server now and once it's ready I can send you some demo examples to have you review and see that the task is complete. We can then finalize with delivery via deployment to your firebase project. I can also make code push commit if you have a private repository and will include basic README.md (markdown) containing the basics for usage.
I like to use Venmo for payment. We can do this after the task is fully complete and ready.
Let me know if you agree with this proposed breakdown.
Regards,
This all sounds reasonable to me, however, I believe the function/service would need to be triggered on at least a daily basis (since there may be frequent updates to events). I don't see any reason that would change the scope though.
Before you continue, let me publish a list of the other sites I'm looking at and let me know if any of them look anymore difficult to scrape than this site. If so, how much more complex? I want to be able to estimate how much it will cost to cover these sites.
Google does not allow on the free Firebase Spark plan "Outbound networking". This means you need to have a monthly billing plan to make outbound requests to external non-google services. I only have the free plan for testing. Not sure if your plan is paid or not. Will setup for now on a different service for demo.
For now I deployed the scraper to Heroku Cloud and I connected to my own Firebase realtime database to store the events for my testing.
Examples: https://imgur.com/a/azQwQSs https://imgur.com/a/x2hEAAa
Cronjob scheduling works. I also implemented basic HTTP auth header requirement for invoking the scraper GET endpoint.
I just shot you an email back. Let me know what repo you want this pushed ... or if we push to this repo ... do you want me to create a new directory called webscrapers ?
Open a PR into this directory please: web-scrapers/sandiego.org/
Develop a web scraper that can scrape events from the site: https://www.sandiego.org/explore/events.aspx
The properties required for each event are: Name/Title Event Type (festival, music, food, etc.) Location/Neighborhood Date Cost (or free if free event) Description Thumbnail image URL URL to purchase tickets and/or the website for the organization hosting the event
For now, the JSON results can simply be stored locally on the machine doing the scraping, but designed to be posted to another server/service once it is ready.
It is not necessary to scrape events more than a month or two in advance, or whatever is easiest for now.
(Please leave a comment if you need further clarification)