incf-nidash / nidm-query

A repository of NIDM queries
0 stars 3 forks source link

Human readable names for queries #13

Open chrisgorgo opened 8 years ago

chrisgorgo commented 8 years ago

I was wondering if we could switch to a more user friendly naming scheme for queries. Currently long uninformative hashes are used witch makes parsing code (that uses for example nidm-api) difficult for humans.

We can alternatively wait until singularity ;)

vsoch commented 8 years ago

I also like this idea - for example, for experiment factory experiments we have our experiment "tag" as unique ID, and it is very intuitive and easy to find what you are looking for. It would be easy to have some kind of version in the name, in the case of multiple of the same query with different versions.

nicholsn commented 8 years ago

I'm not sure it matters since it is just the filename, as soon as you open the file there all the human readable titles and descriptions. I kinda like the consistency of using uuids and pushing all the readable information into the file metadata. It seems redundant to have that info duplicated.

vsoch commented 8 years ago

Let's say you are a developer and you want to update your query. You go to the repo and there are 100 of them...

chrisgorgo commented 8 years ago

Exactly it makes reading code harder.

On Mon, Jan 11, 2016 at 10:36 AM, Vanessa Sochat notifications@github.com wrote:

Let's say you are a developer and you want to update your query. You go to the repo and there are 100 of them...

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170646999 .

nicholsn commented 8 years ago

personally, after there are 100 of them I wouldn't remember the file name anyways and would end up parsing the files and printing out titles and descriptions to find what I was looking for.

nidm list queries --> prints table of filename, title, description

On Mon, Jan 11, 2016 at 10:37 AM, Chris Filo Gorgolewski < notifications@github.com> wrote:

Exactly it makes reading code harder.

On Mon, Jan 11, 2016 at 10:36 AM, Vanessa Sochat <notifications@github.com

wrote:

Let's say you are a developer and you want to update your query. You go to the repo and there are 100 of them...

— Reply to this email directly or view it on GitHub < https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170646999

.

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170647217 .

chrisgorgo commented 8 years ago

What is easier to understand:

results = do_query("7950f524-90e8-4d54-ad6d-7b22af2e895d")

or

results = do_query("get_peak_coordinates")
vsoch commented 8 years ago

I am in agreement with @chrisfilo. It is a detail that will make development much easier. By way of the files needing to exist in the same folder, that alone ensures uniqueness in naming.

nicholsn commented 8 years ago

well I don't understand either without looking at the query, but I see what you mean so much as the filenames in the query library are curated to be informative. I still think including relevant metadata is important to describe the query when not being accessed programatically.

Initially, I had a metadata file that indexes all the queries called, pardon the ttl it could be json, meta.ttl ( https://github.com/nicholsn/niquery/blob/master/niquery/sparql/meta.ttl), and was thinking that queries would be accessed more interactively so you could hide the uuid.

On Mon, Jan 11, 2016 at 10:48 AM, Vanessa Sochat notifications@github.com wrote:

I am in agreement with @chrisfilo https://github.com/chrisfilo. It is a detail that will make development much easier. By way of the files needing to exist in the same folder, that alone ensures uniqueness in naming.

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170649948 .

chrisgorgo commented 8 years ago

Yes we should keep the metadata, but using human readable names will make developers life easier.

nicholsn commented 8 years ago

sure, go for it. can you two decide on a recommended style?

for example all-lowercase-with-hyphens-and-version-1.0.0.json

On Mon, Jan 11, 2016 at 11:17 AM, Chris Filo Gorgolewski < notifications@github.com> wrote:

Yes we should keep the metadata, but using human readable names will make developers life easier.

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170659442 .

vsoch commented 8 years ago

I would say use underscores, because you can't use hyphens in python function names.

  all_lowercase_with_version_1.0.0.json
satra commented 8 years ago

this sounds just like our terms discussions. readable names simply don't scale. i would really like to see us build tools that query the metadata quickly or provide interactive interfaces for editing queries. perhaps we are not at the point yet, but there is a reason why issues on github, questions on stack overflow and google docs all don't have readable names. (stack overflow uses a slug for readability, but the id is what makes things unique)

so instead of punting the interaction between scalable and readability, i think we should put in the effort during the upcoming sprint to have tools that allow us to address this (independent of what the query url looks like). for example, web service/api/command line tool for querying queries.

nicholsn commented 8 years ago

+1

On Mon, Jan 11, 2016 at 11:29 AM, Satrajit Ghosh notifications@github.com wrote:

this sounds just like our terms discussions. readable names simply don't scale. i would really like to see us build tools that query the metadata quickly or provide interactive interfaces for editing queries. perhaps we are not at the point yet, but there is a reason why issues on github, questions on stack overflow and google docs all don't have readable names. (stack overflow uses a slug for readability, but the id is what makes things unique)

so instead of punting the interaction between scalable and readability, i think we should put in the effort during the upcoming sprint to have tools that allow us to address this (independent of what the query url looks like). for example, web service/api/command line tool for querying queries.

— Reply to this email directly or view it on GitHub https://github.com/incf-nidash/nidm-query/issues/13#issuecomment-170663226 .

chrisgorgo commented 8 years ago

Issues on github, questions on stack overflow, and google docs are all examples of instances. Indeed that's where numeric identifies make sense. However we are talking about queries which are considered methods. Those should have human readable names and all of the examples you gave opt for such solution. For example a path for editing a comment on stackoverflow is:

http://stackoverflow.com/posts/34729781/edit

it is NOT

http://stackoverflow.com/posts/34729781/39834-343-683

where 39834-343-683 would correspond to the edit function. This is just not practical. Similar examples can be given for programming languages where functions and methods have human readable names.

satra commented 8 years ago

isn't a query-id simply an instance of a query? if so, all i'm suggesting is that we provide something like:

nidm.nidash.org/query/query-id/edit

as a web service, or something equivalent for other things

i don't think we are talking of queries as methods here ( i can see how it can be seen as such - but i don't think of it that way). anyone can create a query and we will have a collection of queries that an api/web service can call, but they are still instances (they have versions, they will apply to certain versions of the model, they will only work on certain versions of data, etc.,.).

vsoch commented 8 years ago

The nidm-api by default serves a REST API, and the current format to view a query is:

  http://localhost:8088/api/7950f524-90e8-4d54-ad6d-7b22af2e895d

and this generates:

image

The issue still comes up about how the developer finds the query_id. To have to do that extra step every time, and to have to provide more methods to look up / search with the API does not make sense when we can just use strings with underscores that a human can remember.

There are two use cases right now for the API. Either someone uses the REST API and must make a call like the above to retrieve the query and do something with it, or the developer uses our python tool to do the query. The second looks like this:

First we retrieve all queries in a dictionary, with lookup key the unique id

  from nidm.query import Queries, do_query
  all_queries = Queries()
  results = Queries(components="results")

Then we would need to just know the qid. This adds an extra annoying step to figuring out the qid every single time.

  # Select a query from results that we like
  qid = "7950f524-90e8-4d54-ad6d-7b22af2e895d"

  # Here is a ttl file that I want to query, nidm-results
  ttl_file = "nidm.ttl"

  result = do_query(ttl_file=ttl_file,query=results.query_dict[qid]["sparql"])

The result is a pandas data frame. I would even suggest we simplify the above further to be more like what @chrisfilo suggested:

  results = do_query("get_peak_coordinates", ttl_file=ttl_file)

In the eyes of the developer, the query is a method. It is run to retrieve a particular result object. The purpose of the nidm-api, period, is to extend NIDM to developers. This means making using it as easily as possible for them. Insisting on a long string of letters and numbers only with the justification that it scales better is not logical, and in fact it makes life a lot harder for the exact audience we are intending for this tool. It also makes it harder for the people writing the query objects. If I go to the github repo now to find the "get_peak_coordinates" query - where is it? It's not intuitive. Scalability might be an issue if these things are made en-masse in an automated way, but they aren't. We are going to have a limited set because they are made by humans. This means they can give them a name that makes sense. I do not see any benefit in having such cryptic names when the entire purpose is to make this more user friendly.

satra commented 8 years ago

given the nidm-api, not just around nidm-results, the set of possible queries one can make is immense, especially as we allow people to fork/modify queries (by whatever interface - not necessarily a script).

The issue still comes up about how the developer finds the query_id.

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints. in any such event, the number of queries increase at a rapid rate (a la jsfiddle or gists). and this is just speaking about get_peak_coordinates. just the number of queries that i have run around nidm results for freesurfer coupled with other phenotypic data would go beyond a handful.

i completely agree that if the goal of nidm-api is to only expose a finite set of specific queries, those should simply be methods of the API, but if the goal is to to run a generic method as do_query and expect a set of differential datatypes (arrays, dataframes, graphs) depending on the query, then we really have to think further. in fact, in the former scenario do_query itself should be called something else like get_peak_coordinates. in the latter scenario, do_query needs to be able to return different datatypes. i personally think that nidm-api can only be as generic as the gdata api, anything more specific (such as get_peak_coordinates) becomes modules on top of the base api.

if a developer has to use a query the developer needs to understand the nuances of the query, and no amount of human readable name is going to help the developer. that is why i predicated my previous post saying, independent of how the query-id looks we really need to have tools to search through the set of queries and for forking/editing said queries.

i'm completely for the api being easy for developers. what i'm speaking against is the notion that naming a few queries to be human readable is the solution to the problem.

vsoch commented 8 years ago

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints.

Isn't that what variables are for? The queries can have specific variables.

in the latter scenario, do_query needs to be able to return different datatypes.

The datatype returned is not integrated into the query, the user selects datatype to be returned as a variable of the do_query function in the nidm-api. The API always will retrieve the output of the query in some format, and parse to what the user wants.

if a developer has to use a query the developer needs to understand the nuances of the query,

I disagree. If I am a developer all I need is to know the data that I want to retrieve from the input file (such as turtle nidm-results) and the arguments that I can give.

we really need to have tools to search through the set of queries and for forking/editing said queries.

I think that is why we have them on github - to implement our own version of forking / editing seems like re-inventing the wheel. I agree a search function added to the nidm-api to search through the query data structures would be neat.

i'm completely for the api being easy for developers. what i'm speaking against is the notion that naming a few queries to be human readable is the solution to the problem.

I don't think I am suggesting it is a "solution," but it's making it just a little bit harder for people who just want to query some nidm-object to retrieve the data they need.

nicholsn commented 8 years ago

Isn't that what variables are for? The queries can have specific variables.

​true, but you may want to lock in a query to fixed parameters rather than making it a template.​

The datatype returned is not integrated into the query, the user selects datatype to be returned as a variable of the do_query function in the nidm-api. The API always will retrieve the output of the query in some format, and parse to what the user wants.

​If you write a CONSTRUCT query a graph is returned, ASK returns a boolean, and SELECT returns a table... I think that might be what @satra is referring to. ​Also, the format of the output should/can be handled using content negotiation on the server-side. The client shouldn't necessarily be required to handle this (e.g., csv, ttl, json-ld), but in some cases it makes total sense (e.g., dataframes)

I disagree. If I am a developer all I need is to know the data that I want to retrieve from the input file (such as turtle nidm-results) and the arguments that I can give.

This is true for an API endpoint, but queries feel a bit more malleable than this​

​... Its kind interesting, should a query really a be thought of as a method/function​ or something else? What I had in mind for nidm-api is something much more flexible and dynamic, but it sounds like what you are after is a traditional API that has a very limited scope for specific functionality.

I think that is why we have them on github - to implement our own version of forking / editing seems like re-inventing the wheel. I agree a search function added to the nidm-api to search through the query data structures would be neat.

​right, github could be the backend but what about a frontend for forking and editing​ queires like: http://xiphoid.biostr.washington.edu:8080/QueryManager/QueryManager.html#qid=71

I don't think I am suggesting it is a "solution," but it's making it just a little bit harder for people who just want to query some nidm-object to retrieve the data they need.

​i guess its a tradeoff, I would suspect anyone who​ 'just want to query some nidm-object ' wouldn't be happy with either uuids or filenames and would want a tool to help them sift through - but for now we have like 6 queries... so...

cmaumet commented 8 years ago

Very interesting discussion!

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints.

This relates to one point that is not entirely clear for me right now: how do we handle variants of the same query within nidm-query? For example, the get_peak_coordinates query has already existed in several "flavours", e.g. also returning optional peak fwer, also returning statistic type, also returning contrast name... To be extreme, we could even go all the way to the top of the tree and include the type of HRF that was used... The question is where do we stop and how do we decide which of those variants is the one we want in nidm-query? Or do we want all of them?

Isn't that what variables are for? The queries can have specific variables.

@vsoch: this could be part the solution but I am not clear how specific variables could be defined for a given query. Could you give me more details or, even better a small example, of what you had in mind?