o2r-project / containerit

Package an R workspace and all dependencies as a Docker container
https://o2r.info/containerit/
GNU General Public License v3.0
290 stars 29 forks source link

Add a metadata extraction script to higher level containers (session extraction) #47

Open nuest opened 7 years ago

nuest commented 7 years ago

When packaging research into higher level containers, e.g. ERC, we most probably need some meta information. While the use can be asked for this, see #13, it would be better to extract this automagically from the session.

For this, we would need a feature that appends a script to the "main script file" of the container which has access to the R session "after" the analysis is completed.

Some ideas for informations that could be extracted here:

This feature is complementary to the file analysis conducted by @7048730 in https://github.com/o2r-project/o2r-meta

ghost commented 7 years ago

when I first tried to do a session extraction, I thought of something like this to hand out the data objects and environmental informations:

    ls.str()
    sessionInfo()
    sink()
nuest commented 6 years ago

@7048730 @MarkusKonk I thought about this a little more and would like to work on this in the next weeks, if we find a solution to the following issue: a circula dependency between "session metadata extraction" and "metadata extraction + checking".

The current process of creating a compendium (leaving out steps not important for the problem) roughly is

upload > metadata extraction > metadata check > publish compendium > start job: manifest generation

The manifest generation includes running the analysis, so that we know what must be part of the manifest. It must be after the metadata extraction, because it relies o The session after the analysis has completed is also the point where this session extraction would take place. So the problem is: How do we integrate the extractions? o2r-meta already has a mechanism to merge several metadata sources.

Can meta already run only the "integration" part using a set of intput files? We could re-run that when a job has started, or we could make the first job run part of the upload, which means the user would have to wait for the metadata check until after all extractions.

A) integrate metadata after job run

upload > metadata extraction (file-based) > metadata check > publish compendium > start job: manifest generation & metadata extraction (session-based) > notify author > integrate o2r metadata and session metadata

B) run job before real publish

upload > metadata extraction (file-based) > metadata check (partial? only what is needed for manifest generation?) > start job: manifest generation & metadata extraction (session-based) > integrate o2r metadata and session metadata > metadata check ("full" = as it is now) > publish compendium

We cannot get rid of the first metadata check because we cannot reliably detect the "main file"...