pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

Proposed config version update #334

Closed nsheff closed 4 years ago

nsheff commented 4 years ago

PEP config format version 2:

config_version: 2.0
sample_table: blahblahblah/blah.csv
subsample_table: blahblah/subsamples.csv
compare_table: .../compare.csv
sample_modifiers:
  constants:
    attribute1: value
    attr2: val2 
  implied:
    organism:
      human:
        imp_attr: value
      mouse:
        imp_arr: val2
  synonyms:
    oldattr: newattr
  derived: [read1, read2, other_attr]
  derived_sources:
    key1: "path/to/derived/value/{sample.attribute}/{project.attribute}"
    key2: "path/to/derived/value/{sample.attribute}/{project.attribute}"

subprojects:
  etc.

imports:
  - external_pep.yaml
  - http://url.com/pep.yaml

# project-specific attributes
# attributes specifically required by your project
genome: abcd

# tool-specific connections?
# settings read by other tools
divvy:
  divvy settings

looper:
  pipeline_config: ...
  pipeline_args: ...
  pipeline_interfaces: ...
  output_dir:...
  results_subdir: ...
  submission_subdir: ...

Related to:

afrendeiro commented 4 years ago

Here are a few of my thoughts:

I don't have strong feelings about any particular terminology (the current proposal seems nice), as long as the PEP format is clearly versioned.

I assume config_version will be mandatory (maybe pep_format_version would more explicit, and while we're at it why not use SemVer)? Will a new peppy version not be backward-compatible or will it support the previous PEP format and just assume if config_version is not present it is the old format? Sounds like this could cause a few headaches. For me I wouldn't mind if say peppy=1.0.0 would refuse to parse PEPs that are <2.0, but I think that could potentially annoy users. I guess in the end it will boil down to some parser implementation - if that can handle everything fool-proof, than I'd say supporting several PEP versions onwards would be worthwhile.

More generally, which fields are mandatory/required in a PEP? Which leads me to think whether there should be something close to a formal description of the PEP format, maybe as close as possible to a file format specification.

If flexible enough to support additional user-specified sections, I wonder if the PEP could also serve to annotate the nature of the sample variables specified in the sample table. E.g. I've been using sections that list sample_attributes, group_attributes and sometimes biological_attributes and technical_attributes to annotate my variables with groups of attributes that I might want to consider downstream for some analysis of the data. My ngs_toolkit for example makes extensive use of the first two for example. I guess you could say that falls under # tool-specific connections?, but not really because those variables are intrinsic to the (meta)data and not the tool.

+1 for Python3 only support

johanneskoester commented 4 years ago
nsheff commented 4 years ago

Thanks Johannes. A few responses:

I want to vote for excluding tool specific settings and pipeline specific configuration...

You are right about the tool-specific settings. The point was that these would be 'add-ons' defined by tools themselves and not part of the actual PEP spec, exactly as you suggest. In other words, a tool developer could define these "bonus sections" as extensions to PEPs for their tool. Sorry that wasn't clear.

There must be an easy way to join multiple PEPs into one.

Do you think this is solved by subprojects?

What is the purpose of compare_table?

This is just an idea which I've never really finalized, but it would be a table that encoded comparisons. Basically the point is that comparing groups is so frequent that it might be worth coming up with a standardized way to structure which sample group comparisons are of interest within the PEP structure.

Why is derived_sources not a sample modifier?

I debated where to put it, but I think you're right that it should go under that section. I have moved derived_sources section under sample modifiers

johanneskoester commented 4 years ago

There must be an easy way to join multiple PEPs into one.

Do you think this is solved by subprojects?

Probably, how are the samples accessed then? My problem is more of semantic nature. Subprojects imply a hierarchy, which might feel counterintuitive if there is no...

You are right about the tool-specific settings. The point was that these would be 'add-ons' defined by tools themselves and not part of the actual PEP spec, exactly as you suggest. In other words, a tool developer could define these "bonus sections" as extensions to PEPs for their tool. Sorry that wasn't clear.

Ok, also sounds reasonable. Just a thought: wouldn't these make sense to be in a separate file that links to the PEP, and is not part of the spec? This would "by design" avoid that people pollute their PEP with their analysis specific settings.

nsheff commented 4 years ago

wouldn't these make sense to be in a separate file that links to the PEP, and is not part of the spec?

Yes; in fact, here's a proposal that would allow this and maybe get at the 'merging' question... What if there was an imports section that could import other PEPs? All that would happen is we would first read those other PEPs, and then read the current PEP. So, they would be in priority order.

Then you could have a PEP like this:

import: main_pep.yaml
looper:
  settings...

and in main_pep.yaml you'd say:

sample_table: samples.csv
subsample_table: ...

Now the main_pep.yaml is not "polluted" with anything tool specific. Users could put them right in the main PEP, but this woudl discourage it and allow them to keep things totally separate. Further, it would allow merging of PEPs (with the most proximal settings overriding other settings).

I think this is good, but -- keep in mind the differences in perspective we have about the directionality of the connection between sample metadata and workflows... while the looper system has always been this way:

samples -> workflow

the snakemake system is:

workflow -> samples

So I would actually raise the same point about snakemake workflows; analogously, there should be a way to not pollute the workflow with the connection to the sample table. This would encourage the workflow to be more portable across different PEPs.

One approach is to allow all possible connections. We would encourage workflows and PEPs to be independent by allowing the tools that process them (eg looper and snakemake) to accept each peice individually; but, we would allow either to include a pointer to the other, such that you could embed them if you wanted to, at the cost of some perceived portability.

what do you think?

nsheff commented 4 years ago

For your subprojects question:

Probably, how are the samples accessed then? My problem is more of semantic nature. Subprojects imply a hierarchy, which might feel counterintuitive if there is no...

Here's how subprojects work: https://pepkit.github.io/docs/subprojects/

They basically let you embed multiple projects within one project. Maybe embedded_peps is a better term? In thinking about this a bit more, I think this is really not the same as merging PEPs -- that would be solved by the imports idea. But subprojects (or embedded) is a different idea that is also very useful. It's kind of the opposite...it lets you really easy pick and choose slightly different flavors of a project with just a keyword, so you don't have to duplicate them. it can even be used to put completely different PEPs into the same file, if you wanted (but that's not the intent).

Maybe these two just need to have some kind of parallel naming, like imports and exports or external_peps and embedded_peps or something?

nsheff commented 4 years ago

About @afrendeiro's comment:

Which leads me to think whether there should be something close to a formal description of the PEP format, maybe as close as possible to a file format specification.

Indeed, we have a schema for this, which is basically what you're suggesting -- a computer-understandable file format specification. Here's the jsonschema spec for the current PEP format:

http://schema.databio.org/PEP/pep.yaml

This specifies which sections are required, etc. This will already work with eido to validate generic PEPs, and the idea is that people will extend it for a specific workflow that requires specific attributes in the PEP. Here's an example of an extension that we've made for the PRO-seq pipeline:

https://github.com/databio/schema.databio.org/blob/master/pipelines/ProseqPEP.yaml

And, here's my draft for PEP config format version 2 (not currently active yet): https://github.com/databio/schema.databio.org/blob/master/PEP/pep_v2.yaml

johanneskoester commented 4 years ago

One approach is to allow all possible connections. We would encourage workflows and PEPs to be independent by allowing the tools that process them (eg looper and snakemake) to accept each peice individually; but, we would allow either to include a pointer to the other, such that you could embed them if you wanted to, at the cost of some perceived portability.

what do you think?

Yes, that sounds reasonable to me.

So I would actually raise the same point about snakemake workflows; analogously, there should be a way to not pollute the workflow with the connection to the sample table. This would encourage the workflow to be more portable across different PEPs.

That is possible in Snakemake, while indeed best practices encourage to keep the workflow together with the samples, but manage this connection via github. The sample independent workflow is a github repo. For connecting it to concrete samples, people use that repo as a template (via the github template functionality) to generate a new repo where everything is in one place. But that is just a side node. I agree that there are different approaches and your solution sounds reasonable to me.

johanneskoester commented 4 years ago

Maybe these two just need to have some kind of parallel naming, like imports and exports or external_peps and embedded_peps or something?

Not sure about the naming, but I understand the argument about the orthoginality, and agree that import functionality solves my question.

nsheff commented 4 years ago

For naming on the sample modifiers, here's what we've come up with:

This way they are all parallel (adjectives). Another possibility is to use verbs:

which do you prefer?

nsheff commented 4 years ago

And for the subprojects, @stolarczyk and I have a new suggestion and a new feature: amendments.

I realized that 'subprojects' is really how I use them, not a description of what they do. you can use them for lots of other possible things. They are kind of like "project modifiers" (to be parallel with sample modifiers), except they don't run automatically, they can be activated optionally.

Right now, you can select and activate only a single subproject when you load a PEP. But in the new version, we will call these "amendments" and you can provide a priority list of however many you want, and each one will be sequentially activated to amend the project attributes. This would allow you to easily embed different settings within the pep, and then activate them modularly as you need them.

does "amendments" seem more descriptive than "subprojects" ?

nsheff commented 4 years ago

Hi all,

I've now re-written the PEP specification. This summarizes and extends the discussion in this thread. Please review this and provide me with any comments.

~http://pep.databio.org/en/latest/specification/~

Please see the docs here: http://pep.databio.org/en/2.0.0/

nsheff commented 4 years ago

I'm also now using readthedocs for this, so we will version the pep spec on different branches, like this:

http://pep.databio.org/en/2.0.0/

nsheff commented 4 years ago

I think we are nearing the end of this proposal. We have really nice docs outlining the new spec at http://pep.databio.org/en/2.0.0/. If you haven't had a chance to glance through those, please do so.

There's one final proposal in #344 I'd like to get everyone's feedback on. But after that, I think we'll be ready to release this within a week or two and this will be set in (relative) stone.

afrendeiro commented 4 years ago

Had a look at the whole final PEP2.0 structure and I must say I really like it because it is much more clear/explicit and does seem very powerful at the same time with imports and ammendments. Also, yes for #344

vreuter commented 4 years ago

@nsheff I think the docs site is down at the moment? I can hit the home page that you linked in above comment but when I click "Learn more" I get a DNS resolution error then a "bad gateway"

stolarczyk commented 4 years ago

interesting, it works for me. What if you go there directly? http://pep.databio.org/en/2.0.0/specification/

vreuter commented 4 years ago

interesting, it works for me. What if you go there directly? http://pep.databio.org/en/2.0.0/specification/

Yep, that works! Though also the route through the main page does now, too.

vreuter commented 4 years ago

I read through the docs and they look great--sounds like PEPs are only getting more powerful! Below are a few notes:

  1. The graphics are incredible (amendments, modifiers, imports)!
  2. In the "subsample table specification" section, the example sample_table uses the old library key rather than protocol.
  3. The titles in the modifiers subsections look a bit strange, like sample_modifiers.append. I'm not sure if it's the italics, dot notation, or both, but I think it's just the novelty to me. Is the dot standard for JSON-related things, and the italics from something like lucidoc?
  4. Will the docs change now to acct for changes in #344 or will that come later?
  5. This is minor, but it'd feel more intuitive to me to have the arrows directed out from the main sample table and to the subsample table in the figure just above "Validating a PEP"
nsheff commented 4 years ago

Alright, the PEP v2 release is done. thanks all for your input.