isetbio / RemoteDataToolbox

Matlab utilities for managing reading and writing of data files stored on a remote web server.
6 stars 6 forks source link

RemoteDataToolbox architecture #2

Closed DavidBrainard closed 9 years ago

DavidBrainard commented 9 years ago

Ben has written up a little document with a proposal about how to engineer the RemoteDataToolbox to take advantage of existing tools that are out there. Apparently the rubric for what we want to create an interface with is an "artifact repository" (as distinct from a "code repository").

Ben's document is a google doc so you can add comments. https://docs.google.com/document/d/1qONDMs8fUn7Qqd2Nk4F0j0k3hzjb9CxO-zt-05MKo44/edit#

As you'll see at the end of the document, Ben got a prototype version of his proposed solution working with a server on the other side of his office, so it should be quite workable.

Ben can charge away on this, but the key question first is whether we all agree that it is the right proposal. In particular, if we go with the archiva system on the server side, we will want to be sure that we can set that up on the scarlet/crimson server. Similarly if we go with some of the alternate server side solutions.

I have pinged the UPenn server people to see if they have opinions on this topic, but I think we may be out in front of what they have thought through very deeply. Probably Stanford has thought more.

In the meantime, Ben will make sure that he can write data to the server as well as read it, which seems like a key piece of functionality in the long run.

chichilnisky commented 9 years ago

This is really cool and seems well-considered. I much appreciate the architecture diagram at the bottom.

Questions:

ej

On Sep 17, 2015, at 6:50 AM, David Brainard notifications@github.com<mailto:notifications@github.com> wrote:

Ben has written up a little document with a proposal about how to engineer the RemoteDataToolbox to take advantage of existing tools that are out there. Apparently the rubric for what we want to create an interface with is an "artifact repository" (as distinct from a "code repository").

Ben's document is a google doc so you can add comments. https://docs.google.com/document/d/1qONDMs8fUn7Qqd2Nk4F0j0k3hzjb9CxO-zt-05MKo44/edit#

As you'll see at the end of the document, Ben got a prototype version of his proposed solution working with a server on the other side of his office, so it should be quite workable.

Ben can charge away on this, but the key question first is whether we all agree that it is the right proposal. In particular, if we go with the archiva system on the server side, we will want to be sure that we can set that up on the scarlet/crimson server. Similarly if we go with some of the alternate server side solutions.

I have pinged the UPenn server people to see if they have opinions on this topic, but I think we may be out in front of what they have thought through very deeply. Probably Stanford has thought more.

In the meantime, Ben will make sure that he can write data to the server as well as read it, which seems like a key piece of functionality in the long run.

— Reply to this email directly or view it on GitHubhttps://github.com/isetbio/RemoteDataToolbox/issues/2.

benjamin-heasly commented 9 years ago

Thanks!

- does this setup work well with other languages and development environments? This setup was born in the Java world, so it's most optimized for Java projects and dev tools. But I had no trouble adapting it for our Matlab world. I think we could similarly adapt to other worlds, perhaps Python.

I think it will be easy to adapt as long as we're focused on publishing and fetching data. If we find ourselves trying to use other features, like managing complex builds, then we will run into language-specific issues and probably need language-specific tools.

- have we decided which actual server(s) to host things on? seems like we should use just one. The Gradle tool is happy to point at any Maven repository we give it. I think this is a nice separation between client and server.

So if we end up moving the server around, we won't have to reengineer the client.

- how do we annotate/document items on the Maven server in a clean way for browsing? Archiva would give us some basic browsing with artifact descriptions. See here.

If we needed deeper documentation we could put a wiki link in the description. We could also include the documentation as a separate artifact.

- related, is there a system for metadata? this will be important for users to find stuff. Maven Repositories can include a .pom file with each artifact. This is XML metadata.

One interesting kind of metadata we can put in the .pom is "this artifact depends on that artifact". The tools can do cool things like return a requested artifact along with its transitive dependencies.

One way we could use this feature would be to associate two artifacts where one is data and one is documentation about the data.

benjamin-heasly commented 9 years ago

A quick update. I also did a proof of concept for publishing an artifact from client to server. Here is the code.

fmrieke commented 9 years ago

Ben -

This looks great and we will certainly need it. A related issue is that we will want to be able to query the remote data and then grab the results. This way we can search for experimental results of a particular type, then run models to compare or fit to those.

Fred

On Sep 17, 2015, at 3:06 PM, Ben Heasly notifications@github.com wrote:

A quick update. I also did a proof of concept for publishing an artifact from client to server. Here is the code https://github.com/benjamin-heasly/gradle-fetch-poc/blob/master/README.md#publish.

— Reply to this email directly or view it on GitHub https://github.com/isetbio/RemoteDataToolbox/issues/2#issuecomment-141247127.

benjamin-heasly commented 9 years ago

I did some query investigation.

Archiva supports queries through its interactive web UI, so curious users would be able to poke around. It also supports queries through a programmatic REST API.

Queries are outside the scope of what Gradle does, so queries from a Matlab client would be a RemoteDataToolbox feature that we write ourselves.

We can do targeted searches like "does an artifact exist with these exact coordinates?"

We can also do fuzzy searches based on free text matching. I ran an example query using free text "42". I got back two hits from my test repository. One of the hits matched on the artifact version number. The other hit matched on part of the version number. So it looks like this kind of search is pretty inclusive, which seems handy.

For the curious, and for future reference, here is the query I ran and the JSON response.

curl -v -u admin:password "http://localhost:8080/restServices/archivaServices/searchService/quickSearch?queryString=42"
  {
    "context": "test-repository",
    "url": "http:\/\/localhost:8080\/repository\/test-repository\/pringles\/ohno\/4.2.42\/ohno-4.2.42.md",
    "groupId": "pringles",
    "artifactId": "ohno",
    "repositoryId": "test-repository",
    "version": "4.2.42",
    "prefix": null,
    "goals": null,
    "bundleVersion": null,
    "bundleSymbolicName": null,
    "bundleExportPackage": null,
    "bundleExportService": null,
    "bundleDescription": null,
    "bundleName": null,
    "bundleLicense": null,
    "bundleDocUrl": null,
    "bundleImportPackage": null,
    "bundleRequireBundle": null,
    "classifier": null,
    "packaging": "md",
    "fileExtension": "md",
    "size": null,
    "type": "md",
    "path": null,
    "id": null,
    "scope": null
  },
  {
    "context": "test-repository",
    "url": "http:\/\/localhost:8080\/repository\/test-repository\/test-group\/test-id\/42\/test-id-42.txt",
    "groupId": "test-group",
    "artifactId": "test-id",
    "repositoryId": "test-repository",
    "version": "42",
    "prefix": null,
    "goals": null,
    "bundleVersion": null,
    "bundleSymbolicName": null,
    "bundleExportPackage": null,
    "bundleExportService": null,
    "bundleDescription": null,
    "bundleName": null,
    "bundleLicense": null,
    "bundleDocUrl": null,
    "bundleImportPackage": null,
    "bundleRequireBundle": null,
    "classifier": null,
    "packaging": "txt",
    "fileExtension": "txt",
    "size": null,
    "type": "txt",
    "path": null,
    "id": null,
    "scope": null
  }
]
DavidBrainard commented 9 years ago

I had done a little reading recently about what Matlab supports and I think it has some built-in stuff for RESTful queries. This may or may not be useful, just passing it on as something to check before we write our own code.

benjamin-heasly commented 9 years ago

I found Matlab's webread() and webwrite(). These look really handy. I was able to do some Archiva queries with them just now.

One nice feature: automatic conversion of Matlab structs to and from Json.

benjamin-heasly commented 9 years ago

In addition, it looks like we can use the Json converter for our own purposes, independent of webread() and webwrite():

>> data = matlab.internal.webservices.fromJSON('{"foo":"bar"}')
data = 
    foo: 'bar'
>> json = matlab.internal.webservices.toJSON(data)
json =
{"foo":"bar"}
wandell commented 9 years ago

Yes, these are very good. Though there was a transition from earlier versions in which the functions were called urlread and urlwrite. The web<> versions are more recent.

The conversion to and from JSON is also very convenient. We were using libraries for this that were external (but provided by Mathworks). So, again, it is a question of version number and how far back we want to stay compatible.

We have a west coast JSON bias. I think there is an east coast XML bias. I find culture interesting.

Brian

benjamin-heasly commented 9 years ago

Gotcha.

Looks like web* go back to R2014b. This seems awfully recent if we hope to share the toolbox with lots of people.

Looks like url* go back to R2006a and are still present in 2015b. This seems nicer for sharing.

I guess same goes for Json conversions. JSONlab would be nicer for sharing than the matlab.internal functions. I see JSONlab is already included with the RemoteDataToolbox.

I am one east-coaster who prefers Json. I think Xml is only better when there's a complex structure to the data, because Xml comes with Xsd for describing schemas.

benjamin-heasly commented 9 years ago

Blergh, according to doc urlread, "urlread is not recommended. Use webread or webwrite instead."

That's somewhat annoying. Now it's unclear to me which which functions we should use. Here is what I propose:

wandell commented 9 years ago

I think your plan on the fallback from web<> to url<> is very sensible. Maybe it should be implemented as ieWebread and ieWebwrite that has an internal check on whether webread/write exists.

I also agree that JSONlab is simple to understand compared to the code in your example.

So, agreed on all points.

Thanks, Brian

benjamin-heasly commented 9 years ago

Discussion has moved to #4 .