pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

ENH: Linked Datasets (RDF) #3402

Closed westurner closed 10 years ago

westurner commented 11 years ago

ENH: Linked Datasets (RDF)

(UPDATE: see https://github.com/westurner/pandas-rdf/issues/1)

Use Case

So I:

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Read or parse a data format into a DataSet:

Add metadata:

Save or serialize a DataSet into a data format:

Share or publish a serialized DataSet with the internet:

What changes would be needed for Pandas core to support this workflow?

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common set of terms/symbols/structures/schema for describing the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

.

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

Arguments to read_rdf would need to describe which dimensions of data to read into 1D/2D/3D/4D form.

@datastep / PROV

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

RDF Datatypes

JSON-LD (RDF in JSON)

Linked Data Abstractions

URIs and URLs

SQL and Linked Data

Named Graphs

Linked Data Formats

Choosing Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

Datasets have structure:

5 ★ Open Data http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights). ☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML). ☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML). ☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL) ☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

ghost commented 11 years ago

Hi,

Thanks for the thoroughly-researched idea-issue. Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge) taken over the world. Controversially probably, I'm including RDF in that statement, which has certainly gotten a lot of attention and there are real services built on top of it (freebase, opencalais, semantic search engines and so on).

DataFrame metadata has come up again and again, please read through the (long) metadata discussion in #2495 to catch up on some the issues already discussed.

3297 is planned for 0.12, but has nothing to do with RDF and has very limited scope, since

it's intended to answer a different use-case. However, users would be free to embed their own JSON schemas under .meta, so it's somewhat open-ended.

The next step after that, embedding metadata in axis labels is interesting, but right now isn't planned for a specific release. Although I'm sure, the `quantities' users would find that useful.

IMO, it's premature to bake these specs into pandas at this point in the life of the semantic web. Is there a fundamental reason why all this can't be done in an auxiliary package, on top of pandas?

That's my opinion, other devs may feel differently.

ghost commented 11 years ago

Bringing over comments made by @westurner in GH3297 :

https://www.google.com/search?q=sdmx+json http://json-stat.org

westurner commented 11 years ago

Thx.

westurner commented 11 years ago

From https://news.ycombinator.com/item?id=5657935 :

In terms of http://en.wikipedia.org/wiki/Linked_data , there are a number of standard (overlapping) URI-based schema for describing data with structured attributes:

westurner commented 11 years ago
westurner commented 11 years ago

@y-p

Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge) taken over the world. Controversially probably, I'm including RDF in that statement, which has certainly gotten a lot of attention and there are real services built on top of it (freebase, opencalais, semantic search engines and so on).

dr-leo commented 11 years ago

I stumbled upon this proposal while looking for SDMX tools that might help read economic data from Eurostat, the OECD, IMF, BIS and their likes. So a DataFrame.to_rdf method would need to be complemented by a read_sdmx function. Well, the mentioned data providers offer CSV files as well. But the benefits of working with XML and EDIFact-based formats such as described on http://sdmx.org/ are obvious.

I don't know what level of generality would be appropriate to IO just SDMX. But it might be interesting to look at Eurostat's SDMX Reference Implementation and the other material available at.

https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php.

Starting "small" with SCMX might be appropriate to do within pandas. A more general semantic web focused approach can be studied at http://www.cubicweb.org.

benjello commented 11 years ago

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

jreback commented 11 years ago

if someone is interested, could follow the paradigm of pandas.io.wb.py (The world bank dataset) basically wrap functions to get the data and return a frame

westurner commented 11 years ago

read_sdmx would be great.

write_rdf would also be great. (to_triples)

TODO: re-topical-cluster globs of links in this thread. Here are three more:

westurner commented 11 years ago

It may well be easy enough to transform .meta to RDF.

The more challenging part is, IMHO, storing the procedural metadata while/in applying transforms to the Series, DataFrames, and Panels.

From a provenance and reproducibility standpoint: how do downstream users who are not reading the Python source which produced the calculations compare/review the applied data analysis methods (and findings) with RDF metadata?

[EDIT]

There should be a link to the revision id and/or version of the code in the .meta information.

westurner commented 11 years ago

General Ontology Resources:

dr-leo commented 11 years ago

All this looks very interesting.

Again, I recommend a deeper dive into CubicWeb, a web framework supporting RDF and other semantic web standards. It also implements a SparQL-like query language called RQL. Apart from reusing some of its core components it seems worth exploring whether in the long term CubicWeb could be used as a web front end for admin and representation tasks relating to datasets.

There is no doubt a lot of speculation in these statements. But we should avoid reinventing wheels.

the pandas.io.wb.py module is a children's game compared to teaching pandas RDF. The latter goal should probably be pursued in a separate project such as pandas-rdf or pandas-sdmx, as has been suggested before. That said, I know nothing about the relationship between RDF and SDMX.

Writing pandas.io.eurostat.py, oecd.py bis.py modules along the lines of wb.py should not be too difficult, especially if one focuses on CSV formatted data. Still, using SDMX could make the user's life much easier and richer.

To add both complexity and to the links collection: Some elements of the SDMX standard build on EDIFACT. Here,

https://pypi.python.org/pypi/bots-open-source-edi-translator/version%203.0.0

could come in handy.

Am 25.07.2013 22:28, schrieb Wes Turner:

General Ontology Resources:

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/3402#issuecomment-21582221.

westurner commented 11 years ago
westurner commented 11 years ago
westurner commented 11 years ago

TODO: re-work description toward something more actionable (research -> development)

westurner commented 11 years ago

@dr-leo The docs for cubicweb look outstanding.

westurner commented 11 years ago

@benjello

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

Do you have a few links to API specs and or Python implementations? AFAIK there is not yet an extension API for pandas IO and remote_data providers.

[[re: try/except imports / setuptools]]

benjello commented 11 years ago

@westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.

dr-leo commented 11 years ago

The reference implementation of the SDMX framework as freely available on the Eurostat website is written in Java. I am unaware of any other implementation.

The SDMX specification at sdmx.org is not rocket science, but covers several hundred pages.

You may want to set up a separate project, say, PySDMX, and spend some time to understand the reference implementation, divide it into tractable chunks and port these to Python. PySDMX could then use pandas as a storage backend. It could also be designed so as to easily interface with CubicWeb and friends.

Maybe there are mailing lists on SDMX and its implementations where one could ask related questions and reach out for porential contributors.

Leo

Am 20.10.2013 12:01, schrieb Mahdi Ben Jelloul:

@westurner https://github.com/westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/3402#issuecomment-26670042.

jtratner commented 11 years ago

2 comments here:

  1. I'd encourage anyone interested in connecting SDMX to pandas to either submit a PR to pandas or work on a Python reader and then submit a PR to hook the package into pandas. That's the best way to get support support for SDMX into pandas, especially if the spec is 100s of pages and there's an option to load from csv. (and you'd end up with the same thing from loading SDMX into pandas vs. loading csv to pandas).
  2. Special support for RDF is not within scope for pandas right now both because of what @y-p said and because it's not clear how users would want to use it (particularly if this is complex enough to require query languages). I'd imagine that you could keep the RDF descriptors in a column (or whatever you need to use for comparison) and then use those descriptors to traverse after you're finished transforming the data.

If you need help hooking in an already-written SDMX reader into pandas, feel free to ask.

westurner commented 11 years ago

SDMX

RDF Data Cube Vocabulary

[EDIT] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/vocab/

jtratner commented 11 years ago

@westurner why do you post many links with no summary of them? Not really helpful for narrowing things down.

westurner commented 11 years ago

@jtratner Sorry about the noise: I find it easier to get the research together (without Markdown || ReStructuredText formatting).

I am working on a more implementation-focused description for this issue. This appears to be a strong candidate for a meta-ticket, which I do understand are not usually specifically helpful. This may very well belong out of core (import rdflib, most likely), but this seems to be a good place to coordinate efforts. To be clear, I have no working generalized implementation of this: I have one-offs for specific datasets and it seems wasteful. A read_rdf and a to_html5_rdfa could be so helpful.

Storing columnar RDF dataset metadata out-of-band from Series.meta and DataFrame.meta is the easiest thing to do right now.

For the meantime, for reference, above are links to SDMX and (newer, more comprehensive) RDF Data Cube Vocabulary standards.

jtratner commented 11 years ago

@westurner okay - that's helpful :) [and it's much more understandable if that's your process for working towards something] - btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

What I don't understand from everything you've laid out is what you're looking for with read_rdf (to_html5_rdfa actually seems pretty straightforward once you know where data is stored). Are you looking to get data + the associated RDF triple with it? Or keep all of the RDF data from the file you've read in? Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

westurner commented 11 years ago

btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

What I don't understand from everything you've laid out is what you're looking for with read_rdf

A read_rdf may have to be a bit more schema & query opinionated (ie read_sdmx_rdf, read_which_datacube_rdf).

(to_html5_rdfa actually seems pretty straightforward once you know where data is stored).

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L557 (HTMLFormatter)

Are you looking to get data + the associated RDF triple with it?

Like more granular than to_triples? I can't think of a specific use case ATM, but that might also be helpful.

Or keep all of the RDF data from the file you've read in?

Moreso this, I think. ETL [+ documentation] -> Publish

Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

:+1:

westurner commented 10 years ago

I have:

westurner commented 10 years ago

https://github.com/mhausenblas/web.instata

Turn your plain old tabular data (POTD) into Web data with web.instata: it takes CSV as input and generates a HTML document with the data items marked up with Schema.org terms.

westurner commented 10 years ago

CSV on the Web Working Group Charter http://www.w3.org/2013/05/lcsv-charter.html

Data on the Web Best Practices Working Group Charter http://www.w3.org/2013/05/odbp-charter.html

5 ★ Open Data http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

ghost commented 10 years ago

@westurner , you haven't posted any new links in a while. is everything ok?

westurner commented 10 years ago

Stayin' alive. I'll close this for now?

mmalter commented 10 years ago

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.

dr-leo commented 10 years ago

Hi,

I am very pleased to read this and will certainly test it asap.

I am afraid I don't understand the background of your question on exposing Timeseries keys.

Leo

Am 27.02.2014 14:32, schrieb Michaël Malter:

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.


Reply to this email directly or view it on GitHub: https://github.com/pydata/pandas/issues/3402#issuecomment-36241464

westurner commented 10 years ago

I am now thinking that the easiest approach here -- for columnar metadata in pandas (this is an open problem with CSV and most tabular/spreadsheet formats) -- would be dataframe.meta['columns'][column_id].

As mentioned earlier, this is probably not a job for pandas; but for an external "pandas-rdf".

westurner commented 10 years ago

See: @dr-leo

westurner commented 8 years ago

Added: