nf-core/references is a bioinformatics pipeline that build references.
How to hack on it
Have docker, and Nextflow installed
nextflow run main.nf
Some thoughts on reference building:
We could use the glob and if you just drop a fasta in s3 bucket it'll get picked up and new resources built
Could take this a step further and make it a little config file that has the fasta, gtf, genome_size etc.
How do we avoid rebuilding? Ideally we should build once on a new minor release of an aligner/reference. IMO kinda low priority because the main cost is going to be egress, not compute.
How much effort is too much effort?
Should it be as easy as adding a file on s3?
No that shouldn't be a requirement, should be able to link to a reference externally(A "source of truth" ie an FTP link), and the workflow will build the references
So like mulled biocontainers, just make a PR to the samplesheet and boom new reference in the s3 bucket if it's approved?
Roadmap
PoC:
Replace aws-igenomes
bwa, bowtie2, star, bismark need to be built
fasta, gtf, bed12, mito_name, macs_gsize blacklist, copied over
Other nice things to have:
Building our test-datasets
Downsampling for a unified genomics test dataset creation, (Thinking about viralitegration/rnaseq/wgs) and spiking in test cases of interest(Specific variants for example)
Credits
nf-core/references was originally written by @maxulysse.
We thank the following people for their extensive assistance in the development of this pipeline:
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.