ohbm / osr2020

Website for the Open Science Room at the OHBM 2020 meeting
https://ohbm.github.io/osr2020
Other
18 stars 6 forks source link

Open Workflows (Software/process demo): Reproducible Research Objects with DataLad #25

Open jsheunis opened 4 years ago

jsheunis commented 4 years ago

Reproducible Research Objects with DataLad

By Adina Wagner, Institute for Neuroscience and Medicine, Brain and Behavior (INM-7), Juelich Research Centre

Abstract

DataLad makes it easy to link code, arbitrary amounts of data, software environments, procedures used for computations, and the results in a lightweight and easily shareable format, provenance-tracked and version controlled. This allows to create reproducible research objects of any level of elaborateness: From “only” joining data and code, up to completely executable “reproducible paper”-type publications, hosted as open as public repositories on hosting services such as GitHub, GitLab, or Gin. In this demonstration, I will walk through a DataLad-centric analysis workflow using the human connectome project data, featuring

Useful Links

http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html []()

Tagging @adswa

adswa commented 4 years ago

this is to confirm the talk :) (sorry, I missed the follow up e-mail...)

adswa commented 4 years ago

The slides are available here, for anyone who is interested, and a write up with further pointers is here.

Starborn commented 4 years ago

Adina, thanks for sharing, this looks like such great work

Being a data modeller, I wonder: where is the data structure? can the dataset be viewed in a table format? or is this an impossible thing to ask? is the data stored in a relational database in any way? Can the data/program/model be visualized in any other way than by using docker/code?

On Tue, Jun 16, 2020 at 1:28 PM Adina Wagner notifications@github.com wrote:

The slides are available here https://docs.google.com/presentation/d/1KzSJv9j-NwGOZv3dwuM4bgaDQmDbjIQvp8eQ3cOfysw/edit#slide=id.gc6f980f91_0_29, for anyone who is interested, and a write up with further pointers is here http://handbook.datalad.org/en/latest/code_from_chapters/OHBM_OSR.html#ohbmosr2020 .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-644539299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCPDF7E3WVVFKSKQADDRW37IFANCNFSM4M7S2JUA .

adswa commented 4 years ago

Hi @Starborn, no, there is no relational database involved. It all builds up on Git and git-annex. There is a hands-on introduction in the chapter on datasets in the datalad handbook.

As I'm not very familiar with relational databases I may be misunderstanding your questions, so please bear with me and re-ask if necessary ;-)

where is the data structure?

In the case of the HCP data, the original data comes from the Amazon S3 buckets of the HCP project. Once locally available, the data is stored in each datasets "object tree", a key-value store of git-annex within .git/annex/objects of each dataset. There are details on this in this section and a technical overview in git-annex documentation.

Can the data/program/model be visualized in any other way than by using docker/code?

I'm a bit unsure what exactly you are referring to. :) Docker/Singularity is only required to attach a software environment to the data and code in the dataset, and for the execution of commands inside of this software environment. Its not necessary to do this, and by no means necessary for visualizing any dataset contents. If your question is whether there is a GUI, then no, not for datalad. Everything happens as command line calls or via the Python API.

I think it would be a it easier for me to answer if I understood what you are interested in (the "data" aspect of it, i.e., getting HCP data? The reproducible execution aspect? Version control aspect?, ...). In any case, the user documentation http://handbook.datalad.org/en/latest/index.html and the technical docs http://docs.datalad.org/en/stable/ may be a good resource to browse.

Starborn commented 4 years ago

On Wed, Jun 17, 2020 at 1:32 PM Adina Wagner notifications@github.com wrote:

Hi @Starborn https://github.com/Starborn, no, there is no relational database involved. It all builds up on Git and git-annex https://git-annex.branchable.com/. There is a hands-on introduction in the chapter on datasets in the datalad handbook http://handbook.datalad.org/en/latest/basics/basics-datasets.html.

As I'm not very familiar with relational databases I may be misunderstanding your questions, so please bear with me and re-ask if necessary ;-)

where is the data structure?

In the case of the HCP data, the original data comes from the Amazon S3 buckets of the HCP project. Once locally available, the data is stored in each datasets "object tree", a key-value store of git-annex within .git/annex/objects of each dataset. There are details on this in this section http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html and a technical overview in git-annex documentation https://git-annex.branchable.com/internals/.

Can the data/program/model be visualized in any other way than by using docker/code?

I'm a bit unsure what exactly you are referring to. :) Docker/Singularity is only required to attach a software environment to the data and code in the dataset, and for the execution of commands inside of this software environment. Its not necessary to do this, and by no means necessary for visualizing any dataset contents. If your question is whether there is a GUI, then no, not for datalad. Everything happens as command line calls or via the Python API.

I think it would be a it easier for me to answer if I understood what you are interested in (the "data" aspect of it, i.e., getting HCP data? The reproducible execution aspect? Version control aspect?, ...). In any case, the user documentation http://handbook.datalad.org/en/latest/index.html and the technical docs http://docs.datalad.org/en/stable/ may be a good resource to browse.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-645160960, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCNB4OCDKOVL3TEGIXDRXBINLANCNFSM4M7S2JUA .

adswa commented 4 years ago

I think I understand a bit better where you are coming from, thanks for clarifying :)

Starborn commented 4 years ago

Thanks a lot for the considerate reply-

I am going to be online and will try to get my mind around this

Not only in relation to your work, but generally to much of the brain data/apps which I am seeing coming up The starting point for me are resources like these ontologies:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2695392/

https://bioportal.bioontology.org/ontologies/CNO

Can/should the brain data be mapped to some ontology of sort, and is it done or not, if not why not if yes, why cannot I not see it

Thank you

On Wed, Jun 17, 2020 at 2:07 PM Adina Wagner notifications@github.com wrote:

I think I understand a bit better where you are coming from, thanks for clarifying :)

  • First of all, leave containers out of the equation for now. They're certainly useful to understand, but not the starting point.
  • "I guess in the first instance I need to understand where is the data": I would recommend reading the chapter http://handbook.datalad.org/en/latest/basics/basics-datasets.html to get a general idea, and to take a look into git-annex'es documentation for more technical stuff. To phrase it simple: The data can be anywhere (a webtorrent, an S3 bucket, a dropbox account, a private webserver, ...) but its location is registered in a dataset. Upon demand, it can be retrieved in precise versions from this location and is then locally available on your machine.
  • " So, does your system use each data file individually rather than a set of files/records in a database?": I guess it isn't wrong to phrase it like this. DataLad only knows about files and folders, everything happens at the level of individual files in a dataset. It is completely unrelated to any database-related approach.
  • "I want to build knowledge models and processes based on the data/workflows available, so that I can try to query the data to answer different questions": Everything that is done to the data in a dataset is stored in the Git history. Maybe this is a useful starting point for queries. And the git history can be visualized with many existing GUIs. The development of a GUI for DataLad is not actively in progress at the moment. But the datalad handbook is written in a way that you do not need to be familiar with python or the command line to read it, so I'm hopeful that this resource can give a comprehensive understanding of the tool. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-645171889, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCNDKFH7WIY7I6JSR3TRXBMTRANCNFSM4M7S2JUA .

adswa commented 4 years ago

Ah! Maybe http://nidm.nidash.org/ is what you are looking for?

Starborn commented 4 years ago

That helps, it would help if you could explain where in that particular NIDM stack, and also kindly show in your code where is the pointer to the file or range of files, thank a lot!!! working towards understanding the universe [image: image.png]

On Mon, Jun 22, 2020 at 3:12 PM Adina Wagner notifications@github.com wrote:

Ah! Maybe http://nidm.nidash.org/ is what you are looking for?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647331853, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCO3N5VF56IG3IQHKD3RX376XANCNFSM4M7S2JUA .

adswa commented 4 years ago

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

Starborn commented 4 years ago

Thanks, will do

On Mon, Jun 22, 2020 at 3:26 PM Adina Wagner notifications@github.com wrote:

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647337806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCJWJIKHIU4DJAVFHN3RX4BS5ANCNFSM4M7S2JUA .

Starborn commented 4 years ago

Thanks for the talk Adina, I guess its OK if Meghan and I start thinking about a GUI for datalad?

On Mon, Jun 22, 2020 at 3:30 PM Paola Di Maio paola.dimaio@gmail.com wrote:

Thanks, will do

On Mon, Jun 22, 2020 at 3:26 PM Adina Wagner notifications@github.com wrote:

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohbm/osr2020/issues/25#issuecomment-647337806, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFKUCJWJIKHIU4DJAVFHN3RX4BS5ANCNFSM4M7S2JUA .

adswa commented 4 years ago

hihi, sure Paola!