saezlab / dorothea

R package to access DoRothEA's regulons
https://saezlab.github.io/dorothea/
GNU General Public License v3.0
132 stars 26 forks source link

Resume Distributing Tabular Data #16

Closed cthoyt closed 2 years ago

cthoyt commented 4 years ago

An older version of this repository (it appears the git history has been purged) hosted tabular versions of the DoRothEA database from 20180915. More specifically, I was relying on data persisting at the following URL:

https://github.com/saezlab/DoRothEA/blob/master/data/TFregulons/consensus/table/database_normal_20180915.csv.zip?raw=true

My use case was to convert this data to BEL for reuse in larger biological networks (code at https://github.com/bio2bel/bio2bel/blob/master/src/bio2bel/sources/tfregulons.py) as part of the Bio2BEL project, which @deeenes and @Nic-Nic have participated.

Would you be willing to resume distributing the database as a CSV to enable users who aren't using R to access the data? Or maybe there's a link somewhere to a Zenodo archive that I missed, since distributing data through GitHub isn't optimal?

deeenes commented 4 years ago

Yes these files have been removed, also pypath relied on them, now I am planning to move to the Rda format. Of course csv would be somewhat more convenient.

christianholland commented 4 years ago

Hi @cthoyt,

the file you are looking for is still available in the deprecated branch: https://github.com/saezlab/dorothea/tree/deprecated/data/TFregulons/consensus/table

Please note, that this file differs clearly from the dorothea regulons we provide in the R package. The "package regulons" are a subset of this file (+ some additional minor changes).

The most recent file comparable to the one you requested you can find here: https://github.com/saezlab/dorothea/blob/master/data/entire_database.rda

cthoyt commented 4 years ago

Do you think it would be possible to also provide a CSV version of entire_database.rda? I was looking into it and it seems to be a simple table.

christianholland commented 4 years ago

You are right, in the end its just a table, but as far as I know there cannot be .csv files in the data folder of bioconductor packages. The only two ways I could think of how to deposit the csv file is either on zenodo or in the inst/extdata folder.

Do you need to parse this file only once or do you plan to refer regularly to the csv file?

cthoyt commented 4 years ago

I was unaware of that restriction... If I were making conspiracy theories, I'd say this was to lock people into continued usage of R

I will regularly refer to this file at its source, especially because I want to benefit from any updates you make! If I were to just download a file and start working on it locally, I wouldn't be doing reproducible science.

Both hosting on GitHub and Zenodo are good. If you want to go down the GitHub route, you can also automatically back up the entire repo on Zenodo as well

deeenes commented 2 years ago

I've just seen this issue is still open. This Python module can read RDA with absolutely no problem: https://github.com/ofajardo/pyreadr We use it also in pypath: https://github.com/saezlab/pypath/blob/c665bd93b4cc4067e796b055a08dd0e673eaa0ea/src/pypath/inputs/dorothea.py#L309

cthoyt commented 2 years ago

That's great, I had specific problems using pyreadr before but what if someone from a different language wants to use this? I still think distributing only R data makes an unnecessary lock-in to R or languages that support wrapping it, whereas a TSV is universally usable by all languages and workflows

deeenes commented 2 years ago

You are right about other languages @cthoyt. So can csv go to extdata as you told @christianholland?