Validation datasets - Githubissues

FFinotello commented 2 years ago

Hi, @mlist and @grst!

We should decide how to save and (at some point) publish the validation datasets put together as they can be very helpful for the deconvolution community. This will include real bulk datasets (RNA-seq TPM and counts + FACS cell fractions) from human an mouse; the latter would be relevant also for the mouse-immundeconv vignette. All except one dataset are ready and the authors agreed on their publication. We will also have several, simulated pseudobulk datasets that we might want to save online but publish only at a later stage.

What could be the best scheme for storing such data having some flexibility on what could be integrated or made public at a later stage? Whatever scheme we chose, it would be nice to reference it in the omnideconv website in a "validation data" section.

Tagging also @LorenzoMerotto and @alex-d13

Cheers, Francesca

federicomarini commented 2 years ago

For the ease of use and streamlining other things, I would advocate to go for the ExperimentHub way. Programmatic access in R is quite a cool thing, and would simplify the development/benchmark of new methods.

FFinotello commented 2 years ago

For the ease of use and streamlining other things, I would advocate to go for the ExperimentHub way. Programmatic access in R is quite a cool thing, and would simplify the development/benchmark of new methods.

And what are the cons (e.g. data size contraints, implementation) ? ;)

grst commented 2 years ago

In general I think this makes at lot of sense! But since I don't have prior experience with it, I'd also love to hear more from @federicomarini in terms of how this works and possible limitations.

For instance, is there also a way of downloading the datasets without going through the R API? I.e. juts by clicking a download link on the website?

federicomarini commented 2 years ago

Data size constraints: none Implementation: very little overhead, "some documentation enforced" - but that is a huge plus in the long run Access via...: R is straight away. It could be coupled with a link to the same object on Zenodo. The "findability" in the Hub is IMO a too nice aspect to ignore. All in all: I am very happy to assist you in this, if this is the way we choose 😉

grst commented 2 years ago

Where are the files hosted? By them or do you just provide a link to anywhere (zenodo) that's accessible via the API?

On Sun, May 1, 2022, 09:16 Federico Marini @.***> wrote:

Data size constraints: none Implementation: very little overhead, "some documentation enforced" - but that is a huge plus in the long run Access via...: R is straight away. It could be coupled with a link to the same object on Zenodo. The "findability" in the Hub is IMO a too nice aspect to ignore. All in all: I am very happy to assist you in this, if this is the way we choose 😉

— Reply to this email directly, view it on GitHub https://github.com/omnideconv/benchmark/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV7WJX7LSLSOJ7HRO7DVHYVT5ANCNFSM5UYINBBA . You are receiving this because you were mentioned.Message ID: @.***>

federicomarini commented 2 years ago

You upload them once to an AWS bucket or something similar, then the hosting is on their side. Could be also other configs, but never tried them out

See here for more details: https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html

mlist commented 2 years ago

I think this is an important task for the benchmark. @federicomarini given your experience here it would be great if you could give support with the bioconductor technicalities, e.g. contacting the bioconductor hub team, obtaining an access token (it looks like they use Microsoft azure for this) etc. while I'd ask @alex-d13 to prepare and upload the data sets. How does that sound?

omnideconv / deconvBench

Validation datasets #20