Discussion on accessing Australian data

njtierney commented 6 years ago

There is a lot of Australia data sources available through resources such as data.gov.au, which contains a huge amount of public data.

However, it is almost guaranteed that you need to invest a solid chunk of time into cleaning the data and preparing it for analysis and checking the data quality.

I'd like to develop a catalog/table/similar that describes Australian datasets for analysis that are ready, or near-ready to analyse. Or perhaps even just discuss this idea here on the repo.

What I imagine is something like a table where you have columns like:

keywords
abstract
data_type
api_available (E.g., the ukpolice data has a official API, and the Australian census data doesn't)
cleaned (excess whitespace trimmed, variable names make sense - no names like 2016 Rainfall at Toowong_Bowls_Club etc.)
data_in_software: Is the data in an R package or other software?
data_access_in_software: Can you access the data with an R package or other software?

This could help direct the efforts of researchers and analysts, knowing the state of what is ready to access, and also identify those data sources that might be ripe for an R package containing the data, or a way to access it.

I can think of a few R packages and datasets that we could add right now:

eechidna, contains Australian election and census data along with shapefiles, which were downloaded from the ABS and cleaned up and collected together so that they are ready for analysis.
bomrang - fetch Australian Bureau of Meterology data
GSODR Global summary daily weather data in R
ausmacrodata - facilitates the use of quantitative, publicly available Australian macroeconomic data. blogpost

Related to this, there was an R package developed to access data from data.gov.au - ozdata, which could be very useful in accessing the data.

@stevage, do you might have some ideas of where we could start looking? Or thoughts on this topic?

HughParsonage commented 6 years ago

I'd add my own packages to this list: (suggestions for API changes welcome)

The ABS contains some very rich data. However, their interface leaves a lot of room for improvement. I have had some experience with accessing ABS APIs. While I have a lot of sympathy for the ABS, in practice I find it easier, faster, and less error-prone to just download the relevant Excel file and then type out the data into a tsv file manually than to use their API or automate any of the process. This is obviously a bit too laborious to do for all of the ABS catalogue, (and keep updated) though I would be willing to do it if I knew it would be widely used. ABS cooperation would make the process easier, but would not be absolutely necessary.

The ATO has tidier data and a more proactive approach to releasing data. Some quick enhancements to the taxstats package could include adding their Excel tables to the package, and possibly some contemplation of the 16% sample file.

njtierney commented 6 years ago

Wow, that's great, thanks @HughParsonage ! 💯

An outcome of the ozunconf could be to improve upon existing package documentation, perhaps making a pull request to your packages with READMEs or vignettes with examples, and perhaps even package websites, to improve accessibility.

This issue might also fracture out into multiple projects, for example, one group working on documentation, another group working on searching and finding datasets and APIs, and another writing packages for existing data.

Really excited for this issue!

stephstammel commented 6 years ago

This would be a huge productivity benefit for many people.

On a related, but probably separate issue: has anybody been following the discussions around Indigenous Data Sovereignty? It seems to me that open source software/data and projects like this would interface very nicely and provide support for indigenous communities to control, preserve and generate data -> under the leadership and/or acquiescence of the indigenous community, obviously.

dfalster commented 6 years ago

Great ideas re data.gov.au and ABS data (they were on my list to bring up).

In relation to the data from data.gov.au it's also worth pointing to the great NationalMap portal developed by data61 (now NICTA), for displaying spatial data. As far as I can understand, much of that data is all pulled from data.gov.au (see nationalmap.gov.au/help/data-catalogue.html). Its reasonably clean and has standardised formatting. The protal provides a nice way to visualise. But we could consider what is needed to pull layers into R.

njtierney commented 6 years ago

Another example, the Australian Road Deaths Database, contains monthly, quite clean data.

An example of a super brief analysis is here

It looks like there's a bunch of other interesting data, like the airport traffic data.

njtierney commented 6 years ago

@dfalster good idea re the national map portal - It looks like another great trove of data, if there is a way to search it and then pull the layers/shapefiles into R that would be a huge win, @mdsumner, you might be able to speak a bit to this.

mdsumner commented 6 years ago

I love this topic, I have some explorations of TAS open data, cadastre, address, roads etc. Collating sources is a very good plan, I think the synching/reading is pretty well covered by other general tools, but I'm probably going to need to outline the bowerbird way-of-life to show why. (And maybe a good example for a shared vm to prepare...)

mdsumner commented 6 years ago

It looks as though the portal is primarily WMS (images rendered) and CSV, which is not much good. From a quick scan it looks as though going to state-based opendata sources will be better, but happy to be shown otherwise.

njtierney commented 6 years ago

Another resource that might be useful:

https://www.crimestatistics.vic.gov.au/about-the-data

njtierney commented 6 years ago

@mdsumner I'm keen to see the bowerbird way of life! It would be great if we can determine a way to get the shapefiles out from these sources, or even if we can point to where they are stored so we can access them.

raymondben commented 6 years ago

An example of throwing bowerbird at a data.gov.au dataset:

devtools::install_github("AustralianAntarcticDivision/bowerbird")
library(bowerbird)

my_source <- bb_source(
    name="Bike Paths - Greater Geelong",
    id="http://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611",
    description="Polyline data of bike path locations for the City of Greater Geelong.",
    reference="https://data.gov.au/dataset/geelong-bike-paths",
    citation="Not provided, see https://data.gov.au/dataset/geelong-bike-paths ",
    source_url="https://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611",
    license="CC-BY",
    method=quote(bb_handler_wget),
    method_flags=c("--recursive","--level",1,"--accept-regex=download","--adjust-extension"),
    postprocess=quote(bb_unzip),
    collection_size=0.002)

cf <- bb_config("/temp/data") %>% bb_add(my_source)
bb_sync(cf)

This will create the local directory /temp/data/data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611 and mirror the data files associated with that data.gov.au entry. Here there are two subdirectories, one with a kml and the other with a shapefile (unzipped for you, ready to use).

I'm assuming that roughly the same template (with different source_url and other dataset-specific details) would work with other data.gov.au datasets as well. Some of the entries are metadata that are intended for humans to refer to (description, reference, citation, license).

We like bowerbird because (a) it's recursive, so you generally only need to specify the top-level directory, even if the data set contains many files; and (b) it will do incremental updates, so you can run the sync process again later and it will only download what has changed. The ckanr package offers another way of interacting with data.gov.au, but for data retrieval (assuming you know which data sets you want) we find bowerbird to be easier.

jonocarroll commented 6 years ago

FYI, ozdata (the data.gov.au part) never really got wrapped up because we hit a roadblock going down a path we probably didn't need to. I've since cleaned up the functionality and intend to have it in working order (if not on CRAN) before ozunconf17. Searching and mapping work fine now.

njtierney commented 6 years ago

Another data source to potentially look at - Queensland police data: https://www.police.qld.gov.au/online/data/default.htm

njtierney commented 6 years ago

@jonocarroll awesome! I think that for the scope of the ozunconf and to make it easier to maintain, it might be easiest to wrap up the data sources into individual packages and then get ozdata to import them?

ellisp commented 6 years ago

Progress in this space looks both useful and achievable. Also, I'm officially wearing a Stats NZ hat (metaphorically) at this conference and there could be some useful suggestions / opportunities to feedback to New Zealand on this.

timchurches commented 6 years ago

Extending and/or generalising the Census2016 packages by Hugh Parsonage at https://github.com/hughparsonage/Census2016 and https://github.com/hughparsonage/Census2016.Datapack would be great - right down to SA1 level. As Hugh notes, ABS still don't seem to understand that most researchers want clean raw data, not data facsimiles of nicely presented tables with subtotals and totals and weird partial aggregations littered through them...

It probably isn't necessary to include the raw data in such packages, because it is all freely available online, and thus can be downloaded on-demand by functions in the package. ABS are reasonably good at keeping data resources available at specific URLs, once published (but some maintenance is inevitable). It may even be possible to spider and parse the ABS web site pages to dynamically determine data download URLs, which would be more robust.

raymondben commented 6 years ago

Assorted comments (sorry for a very bowerbird focus)!

having had a brief rummage around in ozdata's code I am not sure that bowerbird offers a lot here - there is already functionality to do the downloading and I don't think that bowerbird would add much to this
however, for data that are not in data.gov.au, bowerbird might be worth considering. There is an eechidna-style election data example in the bowerbird readme. The Queensland police data mentioned above should be fairly straightforward too. This sort of approach would follow @timchurches on-demand suggestion above.
@jonocarroll if you are working on ozdata, maybe worth thinking about propagating "citation" info through to the user? That is, data sets that are released under a CC-BY license should have an associated citation that users are obliged to cite when using the data. Making this information easily accessible to the user would be helpful, BUT I am not sure that citation info is part of the standard CKAN schema. I know that some data sets provide a specific "here's how to cite this data set" entry, but maybe this is not consistent enough to include in ozdata in a general manner?
@timchurches - spidering is basically what bowerbird does. Maybe useful in that context.

stevage commented 6 years ago

@dfalster As far as I can understand, much of that data is all pulled from data.gov.au

NationalMap uses the CKAN API to list datasets in data.*.gov.au, but also has lots of other sources of data - manually listed datasets, the ABS SDMX API, various Esri services etc. Most of the "national datasets" are hand-curated.

stevage commented 6 years ago

Hey everyone, and sorry to chime in so late. (Had a very full-on last week). I used to work on NationalMap, and have been pretty active around the open data space in Australia for a few years, working with many government bodies at different levels. (I'm generally working on data that is "useful but boring", rather than ripe for statistical analysis, machine learning etc... however).

But lots of the aggregators and links in the above thread are new to me - that's awesome.

Just to add to a list, I've been working on http://opencouncildata.github.io/Platform, which is another approach to aggregating data - it focuses on data that meet the Open Council Data Standards. The main relevance would be some of the aggregated datasets, like the 500,000 odd trees that power opentrees.org, that might be of interest.

There is also Magda, which I think is meant to eventually replace CKAN as the registry for data.gov.au. It was just being started around the time I left Data61, so I don't know much about it.

Finally, one more interesting dataset you may like is http://github.com/stevage/BikeTrafficCounts, which is - well, read the README.

A dream I've had for a while is to map out the whole potential open data universe as some kind of grid, and start filling in the boxes, based on whether data exists and is public, exists but is not public, or is not public. That is, instead of starting, like most catalogues, from the question of "what is available" and trying to organise those into some useful structure, I'd like to start from the question of "what do people want", and provide definitive answers like "that is not available". It should be possible to start at some high level like "water", and subdivide that domain into "freshwater > river levels > Yarra River > ..." But I'm a bit scared of the ontological work required to make that meaningful :)

(I do suspect that that approach, where you map out the domain, and draw attention to blanks, will yield useful pressure - much in the way that map.opencouncildata.org has been surprisingly useful at encouraging councils to join the open data movement.)

Anyway, I'm really looking forward to supporting whatever project people want to work on, however best. (Caveat: I don't know R :) )

stephstammel commented 6 years ago

Steve I think that's a really useful approach - a resource where people can see what's available, what could be for the asking/pressing and what's not would be useful across all sorts of domains.

stevage commented 6 years ago

(I should mention that there is the open data census but it's really about scoring organisations on a very small number of datasets rather than actually facilitating access to data.)

katerobsau commented 6 years ago

In terms of Aussie data I have been curious about Australian real estate prices, eg. sold, rental etc. I think there is definitely some interesting data mining and analysis that could be done there. @HughParsonage I see you've got a package for NSW property prices. Is this something would be worth generalising to other parts of Aus, like Vic or Qld?

Also @stevage I like the idea of being able to look at what datasets are available for a given gridded location - so often we search by data type, not the other way around.

HughParsonage commented 6 years ago

@saundersk1 While it would be certainly worth generalizing, I'm not aware that the other state governments have released such data.

ropensci / ozunconf17

Discussion on accessing Australian data #17