Open vreuter opened 5 years ago
that is a thing of beauty. :icecream:
hey @johanneskoester what do you think? What's the next step?
So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.
So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.
It handles naming differences you requested removed from the workflow.
Hey @johanneskoester -- it would be possible to not use the SnakeProject
class if you instead subscribed to the original PEP names (like subsample
instead of unit
). All the rest of the functionality is in the main Project
class.
What do you think? Since we used different terminology to refer to these things, we have to either have an adapter, or we have to change one of the standards -- in this case we implemented an adapter.
Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:
Project(..., subsample_index_cols=[...])
.Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:
* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L64) is not needed for Snakemake. In best practice workflows we ensure presence of certain columns via JSON schema validation.
OK
* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L69) I would love to see in the original peppy project. Are there arguments against it? Could be also configurable, in the sense of `Project(..., subsample_index_cols=[...])`.
I'm fine with setting the index. Identifier column name(s) shouldn't be configurable, though. The solution in place bubbles knowledge of the schema with which standard data (like sample name and subsample name/index) may be referenced up to the level of the types themselves (Project
, SnakeProject
). Furthermore, we'd like to couple the column name used to denote something like sample identifier to an actual attribute on Sample
objects that are created and used from the metadata sheets, so parameterizing the sheet column would break that coupling.
For your first point: we are also planning to adopt the snakemake (or similar) validation for PEPs as well. So, sounds good. we should address that soon.
For your second point: I agree we should set the index in peppy and also I think that @vreuter is right that making the column name parameterizable could lead to some downstream trouble... one solution, maybe: could we take whatever column name the user provides and internally just standardize it to 'sample_name'? would solve your concern I think
I think we have a misunderstanding here. With the parameter to Project
I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).
I think we have a misunderstanding here. With the parameter to
Project
I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).
Indeed! Can't speak for @nsheff , but I'd misinterpreted your suggestion. This sounds fine, but to minimally burden Snakemake workflow authors, would you want the default for the second component (besides name) of indexing for subsample_table
indexing to be unit
rather than subsample_name
? (Is column naming in this workflow's units.tsv
typical?)
@johanneskoester, @vreuter -- where do we stand on this? I agree we can easily make this parameterizable. I think we should use subsample_name
and not unit
as the default. but feel free to speak out if you disagree
The selection of columns to index, or certain names? Indexing flexibility is fine, but naming flexibility of particularly meaningful columns isn’t, at least in terms of how they’re referenced once the object is built. So yeah we could accept, for example, unit and subsample_name but have them referenced the same
indexes are parameterize, not names.
With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).
I guess you can make it unit
in the snakeproject object if you want. @johanneskoester can update if he wants otherwise I guess
I am fine with renaming unit to subsample_name in our pipelines here. I would like to not have a snakeproject at all. If I am not wrong, once the indexing can be configured upon instantiating Project
, there should be no need for a snakeproject.
With the new release, can you give me a quick pointer how to adapt this PR to the new capabilities? Or do you want to do it yourself?
@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?
@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?
Good catch, that was not intended. Fixed in master.
@johanneskoester @nsheff
Note that
units_peppy.tsv
is included not for necessity, but to show how apeppy
user may encode a subsample/units table, and have it still function with the workflow.