Open Workflows (Software/process demo): Reproducible Research Objects with DataLad #25

Open jsheunis opened 4 years ago

Reproducible Research Objects with DataLad

By Adina Wagner, Institute for Neuroscience and Medicine, Brain and Behavior (INM-7), Juelich Research Centre

Theme: Open Workflows
Format: Software/process demo

Abstract

DataLad makes it easy to link code, arbitrary amounts of data, software environments, procedures used for computations, and the results in a lightweight and easily shareable format, provenance-tracked and version controlled. This allows to create reproducible research objects of any level of elaborateness: From “only” joining data and code, up to completely executable “reproducible paper”-type publications, hosted as open as public repositories on hosting services such as GitHub, GitLab, or Gin. In this demonstration, I will walk through a DataLad-centric analysis workflow using the human connectome project data, featuring

consuming HCP data with DataLad,
reproducible, re-executable, and provenance-tracked data analyses with DataLad,
and open dissemination of data, workflows, and results in a public repository.

Useful Links

http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html []()

Tagging @adswa

this is to confirm the talk :) (sorry, I missed the follow up e-mail...)

The slides are available here, for anyone who is interested, and a write up with further pointers is here.

Adina, thanks for sharing, this looks like such great work

Being a data modeller, I wonder: where is the data structure? can the dataset be viewed in a table format? or is this an impossible thing to ask? is the data stored in a relational database in any way? Can the data/program/model be visualized in any other way than by using docker/code?

On Tue, Jun 16, 2020 at 1:28 PM Adina Wagner notifications@github.com wrote:

The slides are available here https://docs.google.com/presentation/d/1KzSJv9j-NwGOZv3dwuM4bgaDQmDbjIQvp8eQ3cOfysw/edit#slide=id.gc6f980f91_0_29, for anyone who is interested, and a write up with further pointers is here http://handbook.datalad.org/en/latest/code_from_chapters/OHBM_OSR.html#ohbmosr2020 .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-644539299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCPDF7E3WVVFKSKQADDRW37IFANCNFSM4M7S2JUA .

Hi @Starborn, no, there is no relational database involved. It all builds up on Git and git-annex. There is a hands-on introduction in the chapter on datasets in the datalad handbook.

As I'm not very familiar with relational databases I may be misunderstanding your questions, so please bear with me and re-ask if necessary ;-)

where is the data structure?

In the case of the HCP data, the original data comes from the Amazon S3 buckets of the HCP project. Once locally available, the data is stored in each datasets "object tree", a key-value store of git-annex within .git/annex/objects of each dataset. There are details on this in this section and a technical overview in git-annex documentation.

Can the data/program/model be visualized in any other way than by using docker/code?

I'm a bit unsure what exactly you are referring to. :) Docker/Singularity is only required to attach a software environment to the data and code in the dataset, and for the execution of commands inside of this software environment. Its not necessary to do this, and by no means necessary for visualizing any dataset contents. If your question is whether there is a GUI, then no, not for datalad. Everything happens as command line calls or via the Python API.

I think it would be a it easier for me to answer if I understood what you are interested in (the "data" aspect of it, i.e., getting HCP data? The reproducible execution aspect? Version control aspect?, ...). In any case, the user documentation http://handbook.datalad.org/en/latest/index.html and the technical docs http://docs.datalad.org/en/stable/ may be a good resource to browse.

Thanks a lot for the extensive reply- you answer some of the questions and I need to study the links you point to. My perspective is from a knowledge/problem domain modelling point of view, assume I want to build knowledge models and processes based on the data/workflows available, so that I can try to query the data to answer different questions . I guess in the first instance I need to understand where is the data - but you say your container workflow is not about that. So does a docker container automatically infer the structure of the data? How/where does your workflow port to the data? (I havent gotten my head around containers yet, and I seem to come across a lot of python _ which I dont use yet - but cannot easily find tables/data in any way that I can relate to. So, does your system use each data file individually rather than a set of files/records in a database? sorry I am repeating myself. Thank you for bearing with me while I find my way through this . A GUI would be mice for command line averse folks like me :-)

On Wed, Jun 17, 2020 at 1:32 PM Adina Wagner notifications@github.com wrote:

Hi @Starborn https://github.com/Starborn, no, there is no relational database involved. It all builds up on Git and git-annex https://git-annex.branchable.com/. There is a hands-on introduction in the chapter on datasets in the datalad handbook http://handbook.datalad.org/en/latest/basics/basics-datasets.html.

As I'm not very familiar with relational databases I may be misunderstanding your questions, so please bear with me and re-ask if necessary ;-)

where is the data structure?

In the case of the HCP data, the original data comes from the Amazon S3 buckets of the HCP project. Once locally available, the data is stored in each datasets "object tree", a key-value store of git-annex within .git/annex/objects of each dataset. There are details on this in this section http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html and a technical overview in git-annex documentation https://git-annex.branchable.com/internals/.

Can the data/program/model be visualized in any other way than by using docker/code?

I'm a bit unsure what exactly you are referring to. :) Docker/Singularity is only required to attach a software environment to the data and code in the dataset, and for the execution of commands inside of this software environment. Its not necessary to do this, and by no means necessary for visualizing any dataset contents. If your question is whether there is a GUI, then no, not for datalad. Everything happens as command line calls or via the Python API.

I think it would be a it easier for me to answer if I understood what you are interested in (the "data" aspect of it, i.e., getting HCP data? The reproducible execution aspect? Version control aspect?, ...). In any case, the user documentation http://handbook.datalad.org/en/latest/index.html and the technical docs http://docs.datalad.org/en/stable/ may be a good resource to browse.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-645160960, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCNB4OCDKOVL3TEGIXDRXBINLANCNFSM4M7S2JUA .

I think I understand a bit better where you are coming from, thanks for clarifying :)

First of all, leave containers out of the equation for now. They're certainly useful to understand, but not the starting point.
"I guess in the first instance I need to understand where is the data": I would recommend reading the chapter http://handbook.datalad.org/en/latest/basics/basics-datasets.html to get a general idea, and to take a look into git-annex'es documentation for more technical stuff. To phrase it simple: The data can be anywhere (a webtorrent, an S3 bucket, a dropbox account, a private webserver, ...) but its location is registered in a dataset. Upon demand, it can be retrieved in precise versions from this location and is then locally available on your machine.
" So, does your system use each data file individually rather than a set of files/records in a database?": I guess it isn't wrong to phrase it like this. DataLad only knows about files and folders, everything happens at the level of individual files in a dataset. It is completely unrelated to any database-related approach.
"I want to build knowledge models and processes based on the data/workflows available, so that I can try to query the data to answer different questions": Everything that is done to the data in a dataset is stored in the Git history. Maybe this is a useful starting point for queries. And the git history can be visualized with many existing GUIs. The development of a GUI for DataLad is not actively in progress at the moment. But the datalad handbook is written in a way that you do not need to be familiar with python or the command line to read it, so I'm hopeful that this resource can give a comprehensive understanding of the tool. :)

Thanks a lot for the considerate reply-

I am going to be online and will try to get my mind around this

Not only in relation to your work, but generally to much of the brain data/apps which I am seeing coming up The starting point for me are resources like these ontologies:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2695392/

https://bioportal.bioontology.org/ontologies/CNO

Can/should the brain data be mapped to some ontology of sort, and is it done or not, if not why not if yes, why cannot I not see it

Thank you

On Wed, Jun 17, 2020 at 2:07 PM Adina Wagner notifications@github.com wrote:

I think I understand a bit better where you are coming from, thanks for clarifying :)

First of all, leave containers out of the equation for now. They're certainly useful to understand, but not the starting point.

"I guess in the first instance I need to understand where is the data": I would recommend reading the chapter http://handbook.datalad.org/en/latest/basics/basics-datasets.html to get a general idea, and to take a look into git-annex'es documentation for more technical stuff. To phrase it simple: The data can be anywhere (a webtorrent, an S3 bucket, a dropbox account, a private webserver, ...) but its location is registered in a dataset. Upon demand, it can be retrieved in precise versions from this location and is then locally available on your machine.

" So, does your system use each data file individually rather than a set of files/records in a database?": I guess it isn't wrong to phrase it like this. DataLad only knows about files and folders, everything happens at the level of individual files in a dataset. It is completely unrelated to any database-related approach.

"I want to build knowledge models and processes based on the data/workflows available, so that I can try to query the data to answer different questions": Everything that is done to the data in a dataset is stored in the Git history. Maybe this is a useful starting point for queries. And the git history can be visualized with many existing GUIs. The development of a GUI for DataLad is not actively in progress at the moment. But the datalad handbook is written in a way that you do not need to be familiar with python or the command line to read it, so I'm hopeful that this resource can give a comprehensive understanding of the tool. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-645171889, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCNDKFH7WIY7I6JSR3TRXBMTRANCNFSM4M7S2JUA .

Ah! Maybe http://nidm.nidash.org/ is what you are looking for?

That helps, it would help if you could explain where in that particular NIDM stack, and also kindly show in your code where is the pointer to the file or range of files, thank a lot!!! working towards understanding the universe [image: image.png]

On Mon, Jun 22, 2020 at 3:12 PM Adina Wagner notifications@github.com wrote:

Ah! Maybe http://nidm.nidash.org/ is what you are looking for?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647331853, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCO3N5VF56IG3IQHKD3RX376XANCNFSM4M7S2JUA .

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

Thanks, will do

On Mon, Jun 22, 2020 at 3:26 PM Adina Wagner notifications@github.com wrote:

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647337806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCJWJIKHIU4DJAVFHN3RX4BS5ANCNFSM4M7S2JUA .

Thanks for the talk Adina, I guess its OK if Meghan and I start thinking about a GUI for datalad?

On Mon, Jun 22, 2020 at 3:30 PM Paola Di Maio paola.dimaio@gmail.com wrote:

Thanks, will do

On Mon, Jun 22, 2020 at 3:26 PM Adina Wagner notifications@github.com wrote:

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647337806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCJWJIKHIU4DJAVFHN3RX4BS5ANCNFSM4M7S2JUA .

hihi, sure Paola!

ohbm / osr2020