snakemake-workflows / cyrcular-calling

A Snakemake workflow for ecDNA detection in Nanopore or Illumina sequencing reads derived from DNA samples enriched for circular DNA.
https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows/cyrcular-calling
MIT License
3 stars 2 forks source link

more generic repeat annotation retrieval #9

Open dlaehnemann opened 1 year ago

dlaehnemann commented 1 year ago

My current best and quickest solution for retrieval of repeat annotations is to make the RepeatMasker download link configurable via the config.yaml. However, this has several restrictions:

  1. It only works for the species and genome builds available on the Repeatmasker website, either through the species tree view or the species list view.
  2. It easily gets out of sync with the Ensembl reference species, build and release. These are also specified in the config.yaml file, right before the link spec. But who reads through those things in detail...

However, the download links for RepeatMasker do not seem systematic, with species names sometimes abbreviated (mm for mus musculus, hg for homo sapiens) and sometimes not (for example bosTau) and with only certain species available for certain releases of RepeatMasker and DFAM. So a somewhat systematic download rule with only meta-information provided in the config.yaml (and partly drawin on the Ensembl reference definitions) will not work.

An alternative would be to have a little RepeatMasker workflow with rules that:

  1. Download a specified version (for example 3.7) of the necessary DFAM transposable element specification (for example the Dfam_curatedonly.h5.gz.
  2. Run RepeatMasker on the workflow's Ensembl reference genome using this DFAM resource and generates the species.fa.out.gz files.

However, this seems like slightly excessive downloads and work, especially if one does not want to restrict the annotation to the curated set (the full dfam.h5.gz of version 3.7 is almost 90 GB) and would probably warrant something like a snakemake meta-wrapper. So I'll leave this as possible future work, if this workflow really gets applied more often and on non-standard species.

dlaehnemann commented 1 year ago

The current "best solution" is in:

https://github.com/snakemake-workflows/cyrcular-calling/pull/8/commits/4c441639cb50cd12f1fcf99354e126c0045c094c