Example of validation in python

vsoch commented 8 years ago

Are there examples for reading a provn file and using it to first validate a ttl file, and then extract the fields for use? I found examples of using the provn file to get the ttl path, but it seems like most reading of the file is just opening as text and doing regular expressions.

vsoch commented 8 years ago

Just some follow up on this - I'm writing functions to extract entities, agents, etc in python, and for the most part I can use the http://www.w3.org/2000/01/rdf-schema#label to get a human readable thing, but for a subset of the terms that describe specific values (eg, p-value,) there is nowhere that I can figure out what the value is, at least programatically. This is an issue because I can make an association between a peak having a value, and the peak being associated with other things, but I don't actually know what that value means.

And some general feedback - I've been working for about 36 hours now doing the simple task of getting a .ttl file automatically into a table, and it's impossible. I have used rdf and sparql before, I catch on to computer-y things quickly, and working with this data structure is highly annoying. The ideas behind such a standard are ok in theory, but terrible in practice.

For example, we are not good at searching the internet because Google told everyone to make their sites with a particular format or structure, but because tools were developed that were flexible to work with the formats and structures that were most easily integrated into workflows and thus used.

satra commented 8 years ago

@vsoch - could you point to which ttl file you are working with? let's first make sure that this is in good shape. (and please post before you spend 36 hours! - for folks in my lab i have a 30 min rule :) )

vsoch commented 8 years ago

Hi @satra ! The files I am working with are from our NIDM Showcase, and they are well-formed given the version of NIDM that each represents. The huge amount of time was due to the fact that I first developed my entire application for an older version, and then started over. I should have been more careful to double check, but it made me re-think my parsing strategy to one that would be more flexible to different versions of the files.

My strategy isn't great, and my overall experience was that nidm in turtle is incredibly challenging to work with. I really just needed to get it into some kind of "table thing" in an organized fashion, linking relationships, and it did (embarrassingly) take me the entire weekend. The 30 minute rule is very funny! I would be terrible at it :O)

First, I did actually get it working and it works as a command line tool, or via functions in python to produce the same web viewer, or output html code to embed into a server somewhere. It works with the examples that I linked above, one of which is provided in the repo examples/fsl folder. I have not tested beyond that, and probably won't in the short term (at least until after I have watched some cartoons!)

I'd like to talk about my general strategy, because it likely needs some improvement, and I want to figure out where there is room to improve all things.

My goals were as follows:

to have an application that can take (almost) any form of the data (with different variables, for example), parse into a data frame, and then spit into the viewer.
to view all images associated with the ttl for which a file is defined
to view more than one turtle file at once, possibly with different fields / formats
to be flexible to the user (eg, remove columns, get html code vs. show immediately)

My strategy was as follows

I first parse the ttl into json-ld. Off the bat, this was probably a bad choice, but it seemed reasonable at the time, given that I don't know what I'm looking for in this data structure (and can't query it).
I have a main function to generate a table from said json. It works by first getting overall "groups" (eg, Entity vs Agent), adding in manual fields that I couldn't find programatically (this is what I was posting this issue about). Note that most in that list were in the file, they are only added if the uri is not found programatically.
The table is generated in a more specific function that wants to get peaks and the brainmaps. It programatically gets all fields and generates a lookup table based on finding anything with a label, and then grabs the type and value.
Brain maps are assumed to be anything that is found with a fileName, and has an extension in ".nii" or ".nii.gz". Yes, there were .png files in there too!
I then use lots of ugly pandas commands to filter the table down to coordinates, and another down to peaks, and then cross them to get coordinates associated with peaks.
we return to the main viewer function, do some things like remove column names the user doesn't want, and turn all data into json strings (that can be directly embedded in the html template).
If the user wants to view in the browser, everything is moved to temporary folders and a web server started for viewing. If not, the html is returned.

So, the complexity of the above, and how this will eventually break, is parsing the file into a dataframe. What we really want is just the entire thing organized and embedded as json. If this data was provided in tabular, or even as a graph database like neo4j, it would have been eons easier! I don't know what it is, I've been exposed to RDF and sparql for years, but I find it really hard to use. I expect I'll be dreaming of spreadsheets with neat rows and column names this evening!

satra commented 8 years ago

@vsoch - thank you very much for this detailed reply. but before i respond to different things may i ask, why not use a sparql query directly on the turtle file, like here, which essentially results in a table that i pass to slickgrid:

https://github.com/incf-nidash/nidmviewer/blob/master/index.html#L62

this could be done in python directly using rdflib.

for me the first thing i generally think of is the set of sparql queries i need to get at the relevant pieces, i.e a meta api on the data set.

also to me neo4j data and an rdf dataset are somewhat synonymous, in that they both represent graphs. the primary differences being query format and implementation details.

vsoch commented 8 years ago

Querying would be a good idea, but how am I supposed to write a query if I don't know what the fields are to ask for?

They do both represent graphs, but neo4j would be a lot more intuitive to use. I might give that a go with a NIDM result.

satra commented 8 years ago

Querying would be a good idea, but how am I supposed to write a query if I don't know what the fields are to ask for?

the following queries for a set of fields: https://github.com/incf-nidash/nidmviewer/blob/master/index.html#L74

and the fields and their relationships are described in this model: http://nidm.nidash.org/specs/nidm-results_110.html#nidm-results-core-structures

perhaps i am not understanding your question. so i'll clarify my interpretation.

a nidm result ttl file is a dataset/graph database built using the nidm-result object model (i.e. a "schema")
some subset of information from this can be turned into a table/dataframe.
a query or a set of queries can extract such information.
the query should be formulated using the fields defined and related by the object model.

perhaps these queries are a good starting point in figuring out the model?

https://github.com/incf-nidash/nidm/tree/master/nidm/nidm-results/query

vsoch commented 8 years ago

Will do. Thanks for the detailed help, it will be good for me to figure out how to do this properly.

nicholst commented 8 years ago

If it makes you feel any better, I recall Chris tried to cast NIDM into a relational database a while ago, and we successfully convinced him it was a bad idea.

And one other thing: You say SPARQL queries will be challenge since you don't know what you're looking for, but it appears you do indeed have a narrow focus, on on peaks and images. I have written all of, say, 3 SPARQL queries in my life, but the main nuisance with SPARQL as I see it isn't finding the individual entities you are want, but establishing how they are all interrelated (e.g. a bucket full of xyz coordinates isn't useful if they're a mix of 3 different contrasts).

On Mon, Oct 12, 2015 at 4:13 AM, vsoch notifications@github.com wrote:

Will do. Thanks for the detailed help, it will be good for me to figure out how to do this properly.

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm/issues/341#issuecomment-147281173.

Thomas Nichols, PhD Professor, Head of Neuroimaging Statistics Department of Statistics & Warwick Manufacturing Group University of Warwick, Coventry CV4 7AL, United Kingdom

Web: http://warwick.ac.uk/tenichols Email: t.e.nichols@warwick.ac.uk Tel, Stats: +44 24761 51086, WMG: +44 24761 50752 Fx, +44 24 7652 4532

cmaumet commented 8 years ago

Hi @vsoch! Thanks for your effort on developing for NIDM! Your feedback is very valuable to us and will definitely help improving the documentation. I am assuming you are working with NIDM-Results, is that correct?

About provn documents, the issue is that the python prov toolbox that we use can write provn but cannot read it (yet). That's why we have been working with both provn and ttl.

As far as I can see from your code you are trying to retreive the labels associated with a particular term. Is this correct? If so the manual fields you are referring to could be found and accessed programmatically by using the NIDM-Results ontology file and querying for the label associated with an identifier (e.g. NIDM_0000070 -> "Significant cluster") But maybe I did not correctly get what you are after? Please let me know if this is not what you are looking for or if this does not make sense :)

Could you send again the link to the NIDM document you are looking at? I get a 404 error when trying to opening your NIDM showcase link. Thanks!

Those examples of a larger queries in python might also be of interest for you.

vsoch commented 8 years ago

Hi everyone! Thanks for the great feedback, and in fact I haven't had a chance yet to re-implement the nidmviewer with sparql queries. @cmaumet - the ontology file is the missing piece where I could retrieve the labels, thank you for pointing me to that. I also fixed the link to the neurovault NIDM showcase (it didn't have http and was using relative github link).

I want to bring this into the discussion, because I strongly believe that if we want to support a data structure that is widely used, it has to be easy to implement stuffs with it. There are many barriers to getting from the raw RDF file to something that is use-able, starting from figuring out what it is, how to parse it, installing modules, installing plugins, knowing that sparql exists, figuring out how to write queries, figuring out where the names for the queries come from. I am worried because I usually catch on to these things quickly, but I've been hitting myself in the head with RDF and sparql (at least since 2011) and I consistently run in the other direction.

@satra - you make a good point that these things are just general graphs. We have information that is represented in a graph, and arguably we should be unbiased as to the data structure that we choose to represent the graph. The standards are important - for example, having the unique IDs, the ontology, the existence of something like NIDM-Results, period, but for the data structure itself? I think that it should be simplified down to the most obvious, intuitive thing that is humanly possible. It is challenging if not impossible to represent relationships and graphs in a relational database, I definitely agree! However, there are beautifully simple data structures, specifically graph databases, that would be able to give the user ability to query, visualize, and work with the data all with the same data structure, no RDF parsing headaches or sparql required.

As an example, I have converted a turtle file into a neo4j graph. The graph structure isn't perfect because my parsing of the RDF is (still) buggy, however I was able to hack together a simple script that deploys this web interface instantaneously. A few important notes:

the querying, graph representation, and visualization are all one and the same thing, and it's intuitive to write the queries. For more detail on the formats see here

For example, the query to retrieve the peaks and coordinates looks like this:

A node or relationship can be defined in the graph with commands like this:

  create (_82:coordinate { id : 'http://iri.nidash.org/62f3071d-6fb9-479e-af8c-f4b95c3d0c3a', name :'Coordinate 0001_5',  coordinatevector : '[ -3.77, -80.4, -13.0 ]'})
  create _85-[:`ATLOCATION`]->_79

Making a "gist" on github just means having graph commands in a text file on github, and it renders the graph in the neo4j-gist (akin to ipython notebooks), and lets you show queries, have interactive querying in the browser, and even silly social media stuffs:

When you run it locally to work with a graph, there is a much more detailed interface for querying, downloading in pretty much any format you could like, etc. For example, here I am messing around with the Cognitive Atlas:

A user could basically run a command to parse some input file (in this case it was RDF, and not done correctly), generate a folder, and push that folder to github to immediately have their data accessible in this way.

In summary, I really want to build tools for NIDM-Results. It's really hard to do that because I can't get the data into the formats that are standard for building tools.

nicholsn commented 8 years ago

@vsoch it appears that we have come full circle! My initial experiments with creating and querying provenance graphs was done using neo4j/orientdb and the tinkerpop stack with a query language like cypher called Gremlin =)

Also thanks a bunch for digging through all this and providing feedback!

I agree that neo4j is a great for working with graphs and developing applications - I also like the object oriented query language where you can access attributes using a dot notation. Another thing I like is that edges can have properties - in RDF these are a pain using something called Qualified Relations

The part where property graph databases break down is that you are still creating a data silo with these systems and there is no conformance to naming conventions or standards that enable a dataset to be self describing. By using RDF/OWL, each of term in NIDM can be looked up, including definitions, schema, etc., that were agreed upon through discussion/consensus - so it provides a nice way to create standards-driven data exchange.

To someone developing tools and hit with a steep learning curve I can see where we are in a bit of trouble with this approach, so your feedback is very helpful! I took a second to try and recreate your example and here an what I came up with as a notebook. A bit more verbose than using cypher, but not too crazy.

With RDF etc. there are also some tools similar to what you're showing with neo4j, for example Marmotta RDF store has a query interface that uses YASGUI and Virtuoso has similar utilities that allow for downloading a variety of formats Image Alt

You make a good point about distilling NIDM and RDF more generally into a data format that is standard for building tools. For building an application, I see no reason why not to parse rdf into a graph database, document store, or what have you - but I see this as a view that is extracted from RDF and it is up to us to help create those views while people learn the technology.

The neo4j demo is pretty cool and I like the idea of posting data as a gist, but there is nothing stopping that with RDF other that the large amount of VC funding that neo4j received =)

vsoch commented 8 years ago

haha, cool!

If we are too far down the rabbit hole to stray away from RDF, I think we need to make it a priority to first develop reliable tools that can do very basic things like read, query, and spit out formats like json and data frames from them. Said tools can output formats that would make it very easy to extend "the stuff inside the RDF" to any desired format, neo4j as one of those possibilities, and maybe while sparql would be "under the hood," I don't think the user should have to see or know it. I certainly am developing an allergy to it :)

I'll save some time this weekend for looking carefully at the notebook, thanks so much for doing that, it really helps with my learning! I am confident we can come up with a good solution... we don't have lots of $$$, but we have everything else that we need :O)

vsoch commented 8 years ago

On the other hand, that strategy is like wrapping something complicated in something that looks simple instead of just creating something simple...

Whatever strategy is chosen, it must make these things more simple. I will think more about it as well.

vsoch commented 8 years ago

A quick update! I tried out some of Nolan's code, and it works really nicely to parse into a data frame! I am not technically supposed to be working on this during the week so I didn't take it from start to finish, but it should be a trivial amount of work to transform the nidmviewer to use these queries.

Moving forward - I think we need to figure out how to read the ontology file itself to give some kind of interface for a user to build queries. Sparql may be on the back-end, but the user should basically see all the node types, and all the relationships, and be able to "ask" to retrieve something, and then queries are dynamically built to retrieve the output. And I want functions to parse pretty d3 graphs too :)

nicholsn commented 8 years ago

Glad that worked, @vsoch! One thing you may take a look at is the NDAR Concept Query interface (https://ndar.nih.gov/query_concept.html) - fairly rudimentary but may give you ideas. 100% agree that SPARQL is hidden from the user, just like SQL is, or at least reserved for "power users".

This leaves me thinking that there should be some RESTful or client API sitting on top of these rdf object models, like nidm results, that exposes an easier to use format... These could be implemented as sparql queries, but then exposed as what app developers are used to. I am thinking something like http://api.example.com/v1/peaks that returns:

[
 {
  @context: http://foo.json,
  peak_uri: http://foo#bar:,
  label: bat,
  coordinates: "[0,3,4]"
  },
 {
 ...
 }
]

Of course even that would be hidden from an end user, but perhaps we can come up with some common pieces of info that app dev's would want from nidm results an then simplify working with it in this manner

vsoch commented 8 years ago

RESTful API is a great idea! Ho, that would be so much better.

You know, I had a nutty idea the other day to use github pages to serve a static (json) API. It would be generated from a standard file via a script, and then pushed to github pages. This could be a fantastic use case for that! What do you think?

vsoch commented 8 years ago

Oh, you mean an API to actually do the query and return the result, I see. That is also a great idea, but couldn't be served statically. I was thinking a static API to serve what fields are expected in the file, but that is probably not so useful.

My question then is - where would it be hosted?

nicholsn commented 8 years ago

I don't know if it could be served statically, unless there is some way to do it using something like rdfstore-js. I am curious though what a static API would look like - wouldn't you just pre-serialize the results ahead of time or where you thinking it could be dynamically generated?

For hosting the API, I think we could chat with INCF ... or just get some VC funding =)

On Wed, Oct 14, 2015 at 3:13 PM, vsoch notifications@github.com wrote:

Oh, you mean an API to actually do the query and return the result, I see. That is also a great idea, but couldn't be served statically. I was thinking a static API to serve what fields are expected in the file, but that is probably not so useful.

My question then is - where would it be hosted?

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm/issues/341#issuecomment-148218343.

vsoch commented 8 years ago

I think we should not invite JavaScript to the party, at least for the important parts :) I have quite a bit on my plate for the end of the week, but it sounds like we have a plan in order! Will send updates the next chance I get some time to play around. Thanks for all the resources! (This is super fun :O))

incf-nidash / nidm-specs

Example of validation in python #341

My goals were as follows:

My strategy was as follows