nf-core / pgdb

The ProteoGenomics database generation workflow creates different protein databases for ProteoGenomics data analysis.
https://nf-co.re/pgdb
MIT License
5 stars 8 forks source link

Pass the databases as parameters, skipping downloading #50

Open ypriverol opened 2 years ago

ypriverol commented 2 years ago

@DongdongdongW has reported that COSMIC download sometimes fails to download. Email:

During this process, I encountered some problems. For some reason, the database of COSMIC cannot be downloaded. At the same time, the vcf file from ENSEML is missing in the pipeline. So I chose to download the files from these databases myself and generate the proteogenomics database via pypgatk. When selecting the COSMIC database and cBioportal, I only selected data for cell line A549 and lung cancer type. The size of the database containing the decoy generated by the most popular pypgatk is 3.21GB.

We can add the logic of download using wget and also have an option when the user provides the COSMIC file as a parameter in the pipeline and the pipeline do not need to download it.

DongdongdongW commented 2 years ago

I have set the parameters to upload the COSMIC files.

husensofteng commented 2 years ago

I think it would actually be good to add a parameter e.g. downloaded_data_dir or similar where the user can put pre-downloaded files that are used by the pipeline.

At each download section in the pipeline we can skip downloading the files that already exist in the given directory. Though, I don't know if there is a nice way to implement this in DSL2.