refgenie / refgenconf

A Python object for standardized reference genome assets.
http://refgenie.databio.org
BSD 2-Clause "Simplified" License
3 stars 6 forks source link

add a refgenie asset programmatically? #75

Closed jpsmith5 closed 4 years ago

jpsmith5 commented 4 years ago

Currently, you can get_asset() but there's no add_asset() function. Does it make sense for refgenconf to have this ability? It would be potentially useful for pipeline building purposes to be able to add a custom asset using refgenconf if some item is generated in the pipeline that is unique to a genome and usable beyond the scope of the single pipeline run. Also may be useful for future custom recipe adding in general?

stolarczyk commented 4 years ago

as discussed, this was the design decision that was made to make refgenconf only take care of the config manipulations. All the file related changes (removal/adding...) are performed by refgenie.

I like the division, but maybe we should change that?

nsheff commented 4 years ago

Why would you want to add an asset from within a pipeline? What's the advantage of doing this over just building the asset (even from within the pipeline), and then having access to it? if you can programatically build it, then it has a recipe and should go via build. To me, add is useful for manual stuff you can't build.

jpsmith5 commented 4 years ago

In this case, the asset is unique to a particular kmer length, but reusable for that genome for future analyses of the same read lengths. So it only needs to be constructed once, but is dependent on that source's read length. So I would construct it the first time that read length is encountered, but then it's not needed for future runs. Because it lives in the genome's folder, it made sense for refgenie to know about it, particularly for looking for its presence in future runs using the same approach as for other assets. Therefore, the same genome would have multiple components to the parent asset for varying kmers. So it's not a static build recipe at that point. Does that make sense?

nsheff commented 4 years ago

Does that make sense?

No...It still seems like it's a build command for a read length if it doesn't exist, in which case, refgenie builds and manages it.

I would not create a scripted refgenie asset outside of the build system, and then add it. it doesn't make sense to me.

nsheff commented 4 years ago

alternatively -- if it's not needed in the future, it should not live in the genome folder managed by refgenie.

the only things put into the refgenie genomes folder should be the result of a build process (or a pull). This sounds like it's either a refgenie-managed asset, in which case it should be built, or it's not a refgenie-managed asset, in which case it should not live in the refgenie-managed folder hierarchy.

nsheff commented 4 years ago

@jpsmith5 did you ever resolve this?

jpsmith5 commented 4 years ago

I just defaulted to requiring it to be pre-built. If you're running the pipeline through for the first time and you don't know the read lengths, and therefore have not built the requisite index, it will stop the pipeline and warn you it needs that asset. So then the user would be prompted to go build a new index at that length.

nsheff commented 4 years ago

but the building itself is actually scripted now as a recipe, right? so it's not an add thing, right?

jpsmith5 commented 4 years ago

Correct. It's just a refgenie build procedure using a recipe.

nsheff commented 4 years ago

ok perfect.