My current best and quickest solution for retrieval of repeat annotations is to make the RepeatMasker download link configurable via the config.yaml. However, this has several restrictions:
It only works for the species and genome builds available on the Repeatmasker website, either through the species tree view or the species list view.
It easily gets out of sync with the Ensembl reference species, build and release. These are also specified in the config.yaml file, right before the link spec. But who reads through those things in detail...
However, the download links for RepeatMasker do not seem systematic, with species names sometimes abbreviated (mm for mus musculus, hg for homo sapiens) and sometimes not (for example bosTau) and with only certain species available for certain releases of RepeatMasker and DFAM. So a somewhat systematic download rule with only meta-information provided in the config.yaml (and partly drawin on the Ensembl reference definitions) will not work.
An alternative would be to have a little RepeatMasker workflow with rules that:
Download a specified version (for example 3.7) of the necessary DFAM transposable element specification (for example the Dfam_curatedonly.h5.gz.
Run RepeatMasker on the workflow's Ensembl reference genome using this DFAM resource and generates the species.fa.out.gz files.
However, this seems like slightly excessive downloads and work, especially if one does not want to restrict the annotation to the curated set (the full dfam.h5.gz of version 3.7 is almost 90 GB) and would probably warrant something like a snakemake meta-wrapper. So I'll leave this as possible future work, if this workflow really gets applied more often and on non-standard species.
My current best and quickest solution for retrieval of repeat annotations is to make the RepeatMasker download link configurable via the
config.yaml
. However, this has several restrictions:config.yaml
file, right before the link spec. But who reads through those things in detail...However, the download links for RepeatMasker do not seem systematic, with species names sometimes abbreviated (
mm
formus musculus
,hg
forhomo sapiens
) and sometimes not (for examplebosTau
) and with only certain species available for certain releases of RepeatMasker and DFAM. So a somewhat systematic download rule with only meta-information provided in theconfig.yaml
(and partly drawin on the Ensembl reference definitions) will not work.An alternative would be to have a little RepeatMasker workflow with rules that:
3.7
) of the necessary DFAM transposable element specification (for example the Dfam_curatedonly.h5.gz.RepeatMasker
on the workflow'sEnsembl
reference genome using this DFAM resource and generates thespecies.fa.out.gz
files.However, this seems like slightly excessive downloads and work, especially if one does not want to restrict the annotation to the curated set (the full
dfam.h5.gz
of version3.7
is almost 90 GB) and would probably warrant something like a snakemakemeta-wrapper
. So I'll leave this as possible future work, if this workflow really gets applied more often and on non-standard species.