How to prep indexes, annotations, etc

lcdb / lcdb-workflows

DEPRECATED. Please see https://github.com/lcdb/lcdb-wf

MIT License

1 stars 0 forks source link

How to prep indexes, annotations, etc #6

Open daler opened 8 years ago

daler commented 8 years ago

ggd will be very useful, but I'm not sure it's stable enough at the moment to build a pipeline around.

For now, I propose setting up a snakefile for each genome that prepares sequence, indexes, and annotations according to some naming convention. To minimize future effort in transitioning to ggd, we could follow their convention of {assembly}/{assembly}-{name} directories. Furthermore, each rule should be more or less standalone such that it can be converted into a shell script and bundled into a ggd recipe.

So we'd need rules to create:

dm6-gtf
dm6-refflat
dm6-sequence
dm6-gffutils-db
dm6-star
dm6-bowtie2
dm6-intergenic
dm6-rrna

What are some other options for managing this?

jfear commented 8 years ago

I like this as rule organization.

What about something like Star also uses annotation information, I guess it could be something like.

{assembly}/{assembly}-{name}-{FB-release}

The other thing I am thinking about is how to organize on disk. I typically use a central repository and symlink references in, this saves space and skips the whole reference building process. On the downside, this could lead to collisions between versions, for example if there is a new release of bowtie2 that updates how the references are built. I don't know if this is a real potential problem. Would there be any need to include software version in the name?

For the symlinking aspect, I guess we should create a references workflow that can build the central refs, but also include the rules in the other workflows to build the reference if needed?

daler commented 8 years ago

If we were using ggd, versioning and dependencies are built in and handled by conda so filenames wouldn't have to include version or dependency info. However, I think that means if we want to compare two different versions of flybase in the same workflow, we'd have to switch ggd environments. Otherwise in the standalone snakefile we'd have to track versions as you mention.

I like the idea of a separate references workflow that can be run once per genome on a machine to set everything up, but is also included in other workflows for building if needed.

I think the next step will be figuring out what the configuration should look like for this and what the output directory structure should look like. From there it's straightforward to build the workflow.