steadyfish / ogdindiar

R package to access data from Open Government Data Platform - India
Other
6 stars 3 forks source link

Programmatically downloading datasets not available as API #3

Open steadyfish opened 9 years ago

steadyfish commented 9 years ago

Many of the datasets, although not available as API's, are available in JSON/JSONP format. See if this in anyway can be exploited for programmatic data extract.

steadyfish commented 9 years ago

Compile a dataset of datasets. Possible columns:

  1. Dataset Title
  2. Date Published
  3. Node Link
  4. Available Formats Boolean Columns - is_xml, is_ods, is_csv, is_json, is_jsonp

Steps:

  1. Traverse through all the nodes of this form - https://data.gov.in/node/95667/
  2. If returns 404 error, then page doesn't exist otherwise it does.
  3. In positive cases, try to determine metadata about the dataset. This will incude title, description, ministry, date of pulication, etc.
  4. Figure out which all export formats are available from XML, ODS, XLS, CSV, JSON, JSONP, any other. The link for download would be of the form - https://data.gov.in/node/95667/datastore/export/xml
steadyfish commented 9 years ago

Some of the datasets are also available like this - https://data.gov.in/sites/default/files/Lotus_2015.xml

kartiek commented 8 years ago

Hey! Are you working on this issue? I recently wrote some scripts to scrape and get a list of all the datasets. If you are not working on it, I would be interested in contributing that to the package.

steadyfish commented 8 years ago

Hey @kartiek

Would be happy to include this functionality you've built!

Scraping inherently being a lose solution, not sure if we can include functions/script that scrape this data. Nonetheless, it would be helpful to include a curated dataset about the datasets available.