openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
461 stars 74 forks source link

create tutorial #421

Open kat opened 10 years ago

kat commented 10 years ago

create a step by step tutorial to make it easier to get started

equivalentideas commented 9 years ago

Seems like @emikulic has kicked this off with #796 and #798

henare commented 9 years ago

This kind of relates to #846 too.

equivalentideas commented 9 years ago

After our scraping workshop last week, we got consistent feedback that it would be really helpful to have a step by step guide on how to write a scraper—exactly what this issue is calling for.

This will be a really helpful resource to workshop participants as well as people using morph generally, so I'm gonna whip something up.

equivalentideas commented 9 years ago

Here are steps to writing a scraper, based on my own methods and what @henare demonstrated in the workshop:

  1. Find the data you're looking for and work out if it can be scraped.
  2. If it can be scraped, create your new scraper on morph.io. Pick the language you want to write your scraper in and add a nice description and name for the scraper so people can find it through search.
  3. clone your scrape to your local machine using 'git clone scraper url'
  4. make sure you have all the dependencies installed. If you're writing your scraper in Ruby, do you have Ruby installed? Do you have Bundler installed to manage all the libraries your scraper will need?
  5. Now it's time to start writing your scraper.
  6. open your code editor and look at the example code
  7. define your object with the data you want to collect.
  8. using IRB, get each piece of data for a single record (start small with just one record).
  9. Once you've got each piece of data you need, consider adding a date scraped, so you can verify your data later.
  10. Fill out the record you've defined in your scraper.rb file and use the 'p' (print) method to output the record you've collected when you run the scraper on the command line.
  11. now add a loop to your scraper to get every record you need to on the page
  12. if the records you need cover several pages, you'll need to loop through all the pages
  13. save your data using the scraperwiki library
  14. push your scraper to morph.io
  15. run the scraper and check for errors.
  16. review that data you've collected by looking at the api and downloading the csv
  17. if your scraper needs to run each day, set that on morph.
  18. Celebrate!

For the tutorial, I think it would be helpful to take readers through these steps writing an example scraper. The bills in NSW parliament is a nice simple example that includes pagination, so I think I'll go with that unless we come up with something more exciting.