snakemake-workflows / dna-seq-gatk-variant-calling

This Snakemake pipeline implements the GATK best-practices workflow
MIT License
239 stars 145 forks source link

Peppy #8

Open vreuter opened 5 years ago

vreuter commented 5 years ago

@johanneskoester @nsheff

Note that units_peppy.tsv is included not for necessity, but to show how a peppy user may encode a subsample/units table, and have it still function with the workflow.

vreuter commented 5 years ago

https://github.com/pepkit/peppy/issues/286

nsheff commented 5 years ago

that is a thing of beauty. :icecream:

nsheff commented 5 years ago

hey @johanneskoester what do you think? What's the next step?

johanneskoester commented 5 years ago

So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.

vreuter commented 5 years ago

So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.

It handles naming differences you requested removed from the workflow.

nsheff commented 5 years ago

Hey @johanneskoester -- it would be possible to not use the SnakeProject class if you instead subscribed to the original PEP names (like subsample instead of unit). All the rest of the functionality is in the main Project class.

What do you think? Since we used different terminology to refer to these things, we have to either have an adapter, or we have to change one of the standards -- in this case we implemented an adapter.

johanneskoester commented 5 years ago

Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:

vreuter commented 5 years ago

Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:

* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L64) is not needed for Snakemake. In best practice workflows we ensure presence of certain columns via JSON schema validation.

OK

* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L69) I would love to see in the original peppy project. Are there arguments against it? Could be also configurable, in the sense of `Project(..., subsample_index_cols=[...])`.

I'm fine with setting the index. Identifier column name(s) shouldn't be configurable, though. The solution in place bubbles knowledge of the schema with which standard data (like sample name and subsample name/index) may be referenced up to the level of the types themselves (Project, SnakeProject). Furthermore, we'd like to couple the column name used to denote something like sample identifier to an actual attribute on Sample objects that are created and used from the metadata sheets, so parameterizing the sheet column would break that coupling.

nsheff commented 5 years ago

For your first point: we are also planning to adopt the snakemake (or similar) validation for PEPs as well. So, sounds good. we should address that soon.

For your second point: I agree we should set the index in peppy and also I think that @vreuter is right that making the column name parameterizable could lead to some downstream trouble... one solution, maybe: could we take whatever column name the user provides and internally just standardize it to 'sample_name'? would solve your concern I think

johanneskoester commented 5 years ago

I think we have a misunderstanding here. With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

vreuter commented 5 years ago

I think we have a misunderstanding here. With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

Indeed! Can't speak for @nsheff , but I'd misinterpreted your suggestion. This sounds fine, but to minimally burden Snakemake workflow authors, would you want the default for the second component (besides name) of indexing for subsample_table indexing to be unit rather than subsample_name? (Is column naming in this workflow's units.tsv typical?)

nsheff commented 5 years ago

@johanneskoester, @vreuter -- where do we stand on this? I agree we can easily make this parameterizable. I think we should use subsample_name and not unit as the default. but feel free to speak out if you disagree

vreuter commented 5 years ago

The selection of columns to index, or certain names? Indexing flexibility is fine, but naming flexibility of particularly meaningful columns isn’t, at least in terms of how they’re referenced once the object is built. So yeah we could accept, for example, unit and subsample_name but have them referenced the same

nsheff commented 5 years ago

indexes are parameterize, not names.

With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

I guess you can make it unit in the snakeproject object if you want. @johanneskoester can update if he wants otherwise I guess

johanneskoester commented 5 years ago

I am fine with renaming unit to subsample_name in our pipelines here. I would like to not have a snakeproject at all. If I am not wrong, once the indexing can be configured upon instantiating Project, there should be no need for a snakeproject.

johanneskoester commented 5 years ago

With the new release, can you give me a quick pointer how to adapt this PR to the new capabilities? Or do you want to do it yourself?

vreuter commented 5 years ago

@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?

johanneskoester commented 5 years ago

@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?

Good catch, that was not intended. Fixed in master.