pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Intended behavior for sample validation against schemas #354

Closed vreuter closed 1 year ago

vreuter commented 1 year ago

As currently stands, failure of validation against any one of the schemas available will cause an early return False from the loop over samples and over schemas. However, any schema that cannot be read will simply be skipped. This seems internally contradictory (i.e., the first requires all(success) while the second permits sample passage if nothing succeeds). I guess there are multiple questions.... a) Do we want to keep the short-circuit behavior, returning False when a single sample fails? b) Do we want to continue allow passage of samples that fail to validate at all (e.g., if every schema fails to be found/read)? c) Should the validation be all or any like? That is, should the sample be allowed to proceed if it validates against a single schema, or only if it validates against all schemas?

@nsheff what do you think?

vreuter commented 1 year ago

Currently needed to optimally resolve #350

nsheff commented 1 year ago

I guess the big picture is this:

Looper can submit multiple pipelines for a sample. each one has a piface, and a schema. Ideally, if a sample fails for one schema, it should only not submit for that pipeline -- the others still work.

However. 99.9% of use cases will be for one pipeline per sample. Given that introducing further logic to looper will increase code complexity, and that I'm looking to decrease complexity, I would advocate that this is a case where it would be fine to just say "validation error" and choke altogether, because that's simple, and it's possible that if we programmed this feature, it may never actually be used in real life. So I would only solve the problem the "right way" if it was:

  1. super easy to do
  2. introduced no more than a few lines of code

I think this is unlikely to be the case so I'd just do the easier thing and say "you must be entirely valid to run at all".

Answers to questions

In light of these, here's how I'd currently answer the above 3 questions:

a) Yes, on grounds of simplicity (unless it is very simple to do it the other way). b) I'd be fine with changing it so that if schema is not found, we should fail. c) "any"-like is fine for this use case, again on grounds of simplicity.

This approach fails completely on any validation failure. This is the simplest, easiest implementation. Given that we do it this way, maybe we could do the validation on the whole Project level, instead of on individual samples. This would make things simpler, and also speed up processing.

vreuter commented 1 year ago

This was closed with #352