Closed westurner closed 10 years ago
Hi,
Thanks for the thoroughly-researched idea-issue. Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge) taken over the world. Controversially probably, I'm including RDF in that statement, which has certainly gotten a lot of attention and there are real services built on top of it (freebase, opencalais, semantic search engines and so on).
DataFrame metadata has come up again and again, please read through the (long) metadata discussion in #2495 to catch up on some the issues already discussed.
it's intended to answer a different use-case. However, users would be free to embed their own JSON schemas under .meta, so it's somewhat open-ended.
The next step after that, embedding metadata in axis labels is interesting, but right now isn't planned for a specific release. Although I'm sure, the `quantities' users would find that useful.
IMO, it's premature to bake these specs into pandas at this point in the life of the semantic web. Is there a fundamental reason why all this can't be done in an auxiliary package, on top of pandas?
That's my opinion, other devs may feel differently.
Bringing over comments made by @westurner in GH3297 :
https://www.google.com/search?q=sdmx+json http://json-stat.org
Thx.
From https://news.ycombinator.com/item?id=5657935 :
In terms of http://en.wikipedia.org/wiki/Linked_data , there are a number of standard (overlapping) URI-based schema for describing data with structured attributes:
@y-p
Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge) taken over the world. Controversially probably, I'm including RDF in that statement, which has certainly gotten a lot of attention and there are real services built on top of it (freebase, opencalais, semantic search engines and so on).
I stumbled upon this proposal while looking for SDMX tools that might help read economic data from Eurostat, the OECD, IMF, BIS and their likes. So a DataFrame.to_rdf method would need to be complemented by a read_sdmx function. Well, the mentioned data providers offer CSV files as well. But the benefits of working with XML and EDIFact-based formats such as described on http://sdmx.org/ are obvious.
I don't know what level of generality would be appropriate to IO just SDMX. But it might be interesting to look at Eurostat's SDMX Reference Implementation and the other material available at.
https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php.
Starting "small" with SCMX might be appropriate to do within pandas. A more general semantic web focused approach can be studied at http://www.cubicweb.org.
It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !
if someone is interested, could follow the paradigm of pandas.io.wb.py (The world bank dataset) basically wrap functions to get the data and return a frame
read_sdmx
would be great.
write_rdf
would also be great. (to_triples
)
TODO: re-topical-cluster globs of links in this thread. Here are three more:
schema:
, XSD, semantic web background)It may well be easy enough to transform .meta
to RDF.
The more challenging part is, IMHO, storing the procedural metadata while/in applying transforms to the Series
, DataFrame
s, and Panel
s.
From a provenance and reproducibility standpoint: how do downstream users who are not reading the Python source which produced the calculations compare/review the applied data analysis methods (and findings) with RDF metadata?
[EDIT]
There should be a link to the revision id and/or version of the code in the .meta
information.
General Ontology Resources:
All this looks very interesting.
Again, I recommend a deeper dive into CubicWeb, a web framework supporting RDF and other semantic web standards. It also implements a SparQL-like query language called RQL. Apart from reusing some of its core components it seems worth exploring whether in the long term CubicWeb could be used as a web front end for admin and representation tasks relating to datasets.
There is no doubt a lot of speculation in these statements. But we should avoid reinventing wheels.
the pandas.io.wb.py module is a children's game compared to teaching pandas RDF. The latter goal should probably be pursued in a separate project such as pandas-rdf or pandas-sdmx, as has been suggested before. That said, I know nothing about the relationship between RDF and SDMX.
Writing pandas.io.eurostat.py, oecd.py bis.py modules along the lines of wb.py should not be too difficult, especially if one focuses on CSV formatted data. Still, using SDMX could make the user's life much easier and richer.
To add both complexity and to the links collection: Some elements of the SDMX standard build on EDIFACT. Here,
https://pypi.python.org/pypi/bots-open-source-edi-translator/version%203.0.0
could come in handy.
Am 25.07.2013 22:28, schrieb Wes Turner:
General Ontology Resources:
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/3402#issuecomment-21582221.
TODO: re-work description toward something more actionable (research -> development)
@dr-leo The docs for cubicweb look outstanding.
@benjello
It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !
Do you have a few links to API specs and or Python implementations? AFAIK there is not yet an extension API for pandas IO and remote_data providers.
@westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.
The reference implementation of the SDMX framework as freely available on the Eurostat website is written in Java. I am unaware of any other implementation.
The SDMX specification at sdmx.org is not rocket science, but covers several hundred pages.
You may want to set up a separate project, say, PySDMX, and spend some time to understand the reference implementation, divide it into tractable chunks and port these to Python. PySDMX could then use pandas as a storage backend. It could also be designed so as to easily interface with CubicWeb and friends.
Maybe there are mailing lists on SDMX and its implementations where one could ask related questions and reach out for porential contributors.
Leo
Am 20.10.2013 12:01, schrieb Mahdi Ben Jelloul:
@westurner https://github.com/westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/3402#issuecomment-26670042.
2 comments here:
If you need help hooking in an already-written SDMX reader into pandas, feel free to ask.
SDMX
RDF Data Cube Vocabulary
[EDIT] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/vocab/
@westurner why do you post many links with no summary of them? Not really helpful for narrowing things down.
@jtratner Sorry about the noise: I find it easier to get the research together (without Markdown || ReStructuredText formatting).
I am working on a more implementation-focused description for this issue. This appears to be a strong candidate for a meta-ticket, which I do understand are not usually specifically helpful. This may very well belong out of core (import rdflib
, most likely), but this seems to be a good place to coordinate efforts. To be clear, I have no working generalized implementation of this: I have one-offs for specific datasets and it seems wasteful. A read_rdf
and a to_html5_rdfa
could be so helpful.
Storing columnar RDF dataset metadata out-of-band from Series.meta
and DataFrame.meta
is the easiest thing to do right now.
For the meantime, for reference, above are links to SDMX and (newer, more comprehensive) RDF Data Cube Vocabulary standards.
@westurner okay - that's helpful :) [and it's much more understandable if that's your process for working towards something] - btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata
e.g., _metadata = ['rdf']
, it will generally be moved over to new objects.
What I don't understand from everything you've laid out is what you're looking for with read_rdf (to_html5_rdfa actually seems pretty straightforward once you know where data is stored). Are you looking to get data + the associated RDF triple with it? Or keep all of the RDF data from the file you've read in? Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.
btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.
- https://github.com/pydata/pandas/blob/master/pandas/tests/test_generic.py#L235 (
Generic.test_metadata_propagation
)- https://github.com/pydata/pandas/blob/master/pandas/tests/test_generic.py#L227 (
Generic.check_metadata
)What I don't understand from everything you've laid out is what you're looking for with read_rdf
A read_rdf
may have to be a bit more schema & query opinionated (ie read_sdmx_rdf
, read_which_datacube_rdf
).
(to_html5_rdfa actually seems pretty straightforward once you know where data is stored).
https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L557 (HTMLFormatter
)
Are you looking to get data + the associated RDF triple with it?
Like more granular than to_triples
? I can't think of a specific use case ATM, but that might also be helpful.
Or keep all of the RDF data from the file you've read in?
Moreso this, I think. ETL [+ documentation] -> Publish
Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.
:+1:
I have:
https://github.com/mhausenblas/web.instata
Turn your plain old tabular data (POTD) into Web data with web.instata: it takes CSV as input and generates a HTML document with the data items marked up with Schema.org terms.
CSV on the Web Working Group Charter http://www.w3.org/2013/05/lcsv-charter.html
Data on the Web Best Practices Working Group Charter http://www.w3.org/2013/05/odbp-charter.html
5 ★ Open Data http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
@westurner , you haven't posted any new links in a while. is everything ok?
Stayin' alive. I'll close this for now?
I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.
We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.
Hi,
I am very pleased to read this and will certainly test it asap.
I am afraid I don't understand the background of your question on exposing Timeseries keys.
Leo
Am 27.02.2014 14:32, schrieb Michaël Malter:
I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.
We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.
Reply to this email directly or view it on GitHub: https://github.com/pydata/pandas/issues/3402#issuecomment-36241464
column.name
column.meta.unit
column.meta.precision
I am now thinking that the easiest approach here -- for columnar metadata in pandas (this is an open problem with CSV and most tabular/spreadsheet formats) -- would be dataframe.meta['columns'][column_id]
.
As mentioned earlier, this is probably not a job for pandas; but for an external "pandas-rdf".
Added:
ENH: Linked Datasets (RDF)
(UPDATE: see https://github.com/westurner/pandas-rdf/issues/1)
Use Case
So I:
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish
Series
,DataFrame
s,Panel
s, andPanel4D
s as structured, hierarchical, RDF linked data ("DataSet").Status Quo: Pandas IO
http://pandas.pydata.org/pandas-docs/dev/io.html
.
Read or parse a data format into a DataSet:
pandas.read_*
read_clipboard
read_csv
read_excel
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sql
read_stata
read_table
pandas.HDFStore
Add metadata:
Save or serialize a DataSet into a data format:
pandas.DataFrame.
to_csv
to_dict
to_excel
to_gbq
to_html
to_latex
to_panel
to_period
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_wide
Share or publish a serialized DataSet with the internet:
GET/POST /container/filename.csv
# [.json|.xml|.xls|.rdf|.html]GET/POST to
/container/filename.csv`python -m SimpleHTTPServer 8088
Implementation
What changes would be needed for Pandas core to support this workflow?
.meta
schemato_rdf
for Series, DataFrames, Panels, and Panel4Dsread_rdf
for Series, DataFrames, Panels, and Panel 4Ds@datastep
process decoratorsDataSet
DataCatalog
of precomputed aggregations/views/slices..meta
?).meta
schemaIt's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common set of terms/symbols/structures/schema for describing the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata within Pandas (e.g. in
.meta['columns'][colname]['schema']
, or as a JSON-LD@context
).Ontology Resources
rdfs:
)owl:
)CSV2RDF (
csvw
)W3C PROV (
prov:
)schema.org (
schema:
)W3C RDF Data Cube (
qb:
)to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
fmt
.
Series.meta
Series.to_rdf()
DataFrame.meta
DataFrame.to_rdf()
Panel.meta
Panel.to_rdf()
Panel4D.meta
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()
Arguments to
read_rdf
would need to describe which dimensions of data to read into 1D/2D/3D/4D form.@datastep / PROV
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
Units support
RDF Datatypes
from rdflib.namespace import XSD, RDF, RDFS
from rdflib import URIRef, Literal
JSON-LD (RDF in JSON)
Linked Data Primer
Linked Data Abstractions
graph.triples((None, None, None))
SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
URIs and URLs
urn:
uuid:
SQL and Linked Data
Named Graphs
GRAPH ?g
Linked Data Formats
Choosing Schema
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
Shared or published DataSets have URLs:
DataSets are about certain things:
DataSets are derived from somewhere, somehow:
Datasets have structure:
5 ★ Open Data http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
https://en.wikipedia.org/wiki/Linked_Data