refgenie / refgenconf

A Python object for standardized reference genome assets.
http://refgenie.databio.org
BSD 2-Clause "Simplified" License
3 stars 6 forks source link

First pass at populator and looper plugin #113

Closed nsheff closed 3 years ago

nsheff commented 3 years ago

Wait for:

This PR introduces a new concept called the refgenie populator. It is used to scan input for refgenie:// registry paths and replace them with their refgenie seek local paths.

Use cases:

It can be used to pre-populate input for a workflow run. This way, even a refgenie-unaware workflow can benefit from refgenie because you just run this populator on your input data just before running the workflow on it. This is the way CWL works; CWL workflows in best practices require knowledge of all input files before the workflow run begins. So, it doesn't make sense to pass a registry path, which is then resolved by refgenie inside the workflow.

It's also nice that this makes the workflow itself independent of refgenie.

I've implemented a looper plugin that makes this automatic. Use like:

pre_submit:
  python_functions:
  - refgenconf.looper_refgenie_plugin
  - looper.write_sample_yaml_cwl

I would suggest we also implement a refgenie populate command-line command that would populate any refgenie:// paths and give the result to stdout, which could be easily done using functions included in this PR.

See it in action here: https://github.com/pepkit/pep-cwl/tree/32944212e109fae09119c3196c90596e7101b61d/bioinformatics_demo

piface using the plugin: https://github.com/pepkit/pep-cwl/blob/dev/bioinformatics_demo/bt2_cwl_interface.yaml

nsheff commented 3 years ago

I'm just wondering if populate_refgenie_refs is not just a special case of replace_str_in_obj, which I recently implemented.

well, probably related -- but it's more complicated here, because I have to use rgc.seek to look up what to replace with, since the str to replace is a registry path which is dynamic... so I doubt it would be straightforward to reuse in this case...

nsheff commented 3 years ago

@stolarczyk would you mind taking a quick look at this merge conflict? I don't want to mess anything up. I'm planning to merge this PR now

nsheff commented 3 years ago

For example, I'm not sure why __all__ was removed in dev

nsheff commented 3 years ago

@stolarczyk can you explain why you changed the branch from dev to master?

codecov[bot] commented 3 years ago

Codecov Report

Merging #113 (5cdfa13) into dev (7681851) will increase coverage by 0.36%. The diff coverage is 86.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #113      +/-   ##
==========================================
+ Coverage   77.54%   77.91%   +0.36%     
==========================================
  Files          31       33       +2     
  Lines        2601     2694      +93     
==========================================
+ Hits         2017     2099      +82     
- Misses        584      595      +11     
Impacted Files Coverage Δ
refgenconf/populator.py 66.66% <66.66%> (ø)
refgenconf/refgenconf.py 66.34% <67.74%> (+0.19%) :arrow_up:
refgenconf/__init__.py 100.00% <100.00%> (ø)
tests/test_populate.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7681851...5cdfa13. Read the comment docs.

stolarczyk commented 3 years ago

I changed the branch so that we can release it sooner, you said it's ready. I didn't want it to get stuck for a couple of months in dev before we make another huge release with the new recipe system.

stolarczyk commented 3 years ago

Just realized that there will be a sooner release, introducing the "cloud links" so maybe that's all for nothing..

stolarczyk commented 3 years ago

how about making it a RefGenConf method? RefGenConf.populate or RefGenConf.populate_ref

nsheff commented 3 years ago

I had some reason for not doing it that way at first... I think it had something to do with the way looper plugins work.

but I also wrote a method that calls this function as well, I guess I haven't pushed it yet

nsheff commented 3 years ago

actually I think at this point it can just all become a method.

nsheff commented 3 years ago

great!

You can also add a test that considers a structured input. you can give .populate either a raw string, which will search for any refgenie paths, or you can give it a nested dict, and it will search all values of the dict, following nesting.

stolarczyk commented 3 years ago

that's exactly what I'm doing now :)

nsheff commented 3 years ago

is this ready to merge?

stolarczyk commented 3 years ago

I'm not planning any updates here, so it's ready if you are done