pepkit / eido

Validator for PEP objects
http://eido.databio.org
BSD 2-Clause "Simplified" License
4 stars 6 forks source link

Processed PEP filter #39

Closed nsheff closed 2 years ago

nsheff commented 2 years ago

We need to have a built-in filter that returns a processed PEP.

So, the input is, of course, a PEP. The output is a PEP that has run through sample modifiers and project modifiers -- that is, a processed PEP.

So, the user who calls the filter, if he wants to spit the output to files, needs to provide several output file paths; in fact, all the possible files that could go into a PEP:

See PR #38.

All the filter does is load the PEP with peppy, and then return the objects (as strings), which will have been processed.

@nleroy917 does this make sense to you?

nsheff commented 2 years ago

In addition the file path output (the paths parameter), this filter should also accept a dict of environment variables, which will be used in the processing step. This way, users can provide environment variables, which will be used to expand paths.

The value there is that we can use this to construct a proper processed PEP, with correct local paths, on whatever computing environment, even remotely.

This will be useful for implementing https://github.com/pepkit/pephub/issues/3

nleroy917 commented 2 years ago

I think. The input to a conversion filter function then needs to be:

{
  "paths": {
    "cfg": "/path/to/config,
    "samples": "/path/to/samples"
  },
  "envs": {
    "env1": ...
  }
}

And if paths is specified, then the files are written to those paths utilizing those env vars?

nsheff commented 2 years ago

I think. The input to a conversion filter function then needs to be:

{
  "paths": {
    "cfg": "/path/to/config,
    "samples": "/path/to/samples"
  },
  "envs": {
    "env1": ...
  }
}

And if paths is specified, then the files are written to those paths utilizing those env vars?

Yep! you got it.

Specifically, the paths may be something like "cfg": "${env1}/subpath/to/config" Hence the need for expandpath for the env vars. Oops I wasn't thinking about this correctly. The env vars would be used in the populating of the paths in the sample table -- not necessarily in the paths passed in by the user (it shouldn't be necessary there).

Anyway, what you said is still right -- just keep in mind the env vars aren't for these paths, they're for use by peppy in populating the derived columns, which is even easier.

nleroy917 commented 2 years ago

Anyway, what you said is still right -- just keep in mind the env vars aren't for these paths, they're for use by peppy in populating the derived columns, which is even easier.

Does a filter need then to set these env variables when it runs? E.g.

for env in cfg["envs"]
    os.environ[env] = cfg["envs"][env]

Also for the proposed filter. Does something like this make sense:

def processed_pep_filter(cfg: dict):
    proj = peppy.Project(cfg=cfg['cfg'])
    return {
        'proj': str(proj),
        'samples': str(proj.samples)
    }

Maybe the calling function wrapper for these filters can set env variables?

nsheff commented 2 years ago

Does a filter need then to set these env variables when it runs?

Something like this... or else, some other way to just make those available to the expandpath functions that get used within Peppy. I'd have to dive into peppy and do some trial and error to say exactly how to do it. your solution may work.

Does something like this make sense

Well, you'll need to convert these into files, ultimately. So they'll have to be string representations. Remember, the output of a filter is one or more files. Oops I see now you are actually using string representations. But it might have to be more complicated than the generic representation -- if we want the samples table to come out as a processed csv, for example, I don't think str(proj.samples) does it.

So you're on the right lines, but it also has to include the subsamples file, I guess (unless you're using the new one-file sample table approach, which may be better).

nleroy917 commented 2 years ago

Presently, the wrapper around the conversion filters passes a peppy.Project instance and a plugin_kwargs dict. Does it make sense to just utilize that to pass the file paths?

plugin_kwargs = {
  "paths": {
    "cfg": "/path/to/config,
    "samples": "/path/to/samples"
  },
  "envs": {
    "env1": ...
  }
}

It currently looks for "paths" inside that dictionary and if it finds it, it spits the result out to that path. Maybe we need both a "paths" and "path" key to direct the output to a directory and direct the names of the files... I'd be afraid of moving away from the path keyword for fear of breaking anything else inside eido since it feels like that is used heavily

nsheff commented 2 years ago

I'd be afraid of moving away from the path keyword for fear of breaking anything else inside eido since it feels like that is used heavily

I don't think it is... I wouldn't worry about this... I think this is all new

This is too complicated. Don't use the folder method at all. It should be only paths.

nsheff commented 2 years ago

Presently, the wrapper around the conversion filters passes a peppy.Project instance and a plugin_kwargs dict. Does it make sense to just utilize that to pass the file paths?

I'm not sure I really understand the question... For a filter (any filter), to call the filter, you'd give a paths parameter, where you'd specify where you want it to spit things out. This is the same for every filter. This is the only way to do it, isn't it?

nsheff commented 2 years ago

See comment here, I think this explained it well: https://github.com/pepkit/eido/pull/38/files/2815ecb1b2abac6cbee0cdeea62b939e4c311209#r864395751

nleroy917 commented 2 years ago

yeah I think that makes sense, and I'm trying to switch over to that implementation