pepkit / eido

Validator for PEP objects
http://eido.databio.org
BSD 2-Clause "Simplified" License
4 stars 6 forks source link

Filters should return conversion result, not just print them out (#37) #38

Closed nleroy917 closed 2 years ago

nleroy917 commented 2 years ago

I just changed the code to return the result in addition to logging it out. I think still inside the conversion_plugins.py file, one will need to return the data as well for this to work (otherwise it will return None). I am not sure if this "breaks" the idea of the cli. So, we can perhaps find another solution for pephub if necessary -- like capturing std.out instead of directly getting a returned value.

nleroy917 commented 2 years ago

Meeting Notes 01/24/2022

make it so that filters return outputs instead of printing them keep in mind outputs may contain multiple things

pep -> filter -> converted pep

filter function is independent of filepaths

nleroy917 commented 2 years ago

@nsheff if you want to give a look at the commits I made... I tried to capture whatever bytes-like object the filter plugin was generating and dumping to a file, and either coerce to a string or just return the generated string to the calling function, run_filter. Inside the calling function, I check for the file path and dump to a file if it requires, otherwise I write out the string to std.out and then also return that string (or bytes object).

I have yet to test the functions, but I thought I'd stop here before attempting multiple paths.

nsheff commented 2 years ago

test your functions. I think you're on the right track.

this will only work with 1-path plugins, though.

nleroy917 commented 2 years ago

I've tested all my functions with the available pep in tests/data/peps and for one path it seems to function well. How do we envision passing multiple paths to the modules when used inside another python program? (pephub for example).

nsheff commented 2 years ago

this is what we were talking about on monday, I thought an object like this might work:

paths = {
  "cfg": "/path/to/config.yaml",
  "samples" : "/path/to/samples.csv"
}

Or, we can just not accommodate multiple paths.

nleroy917 commented 2 years ago

Right. I guess I am just thinking of how that'd work from either the command line or inside another package... it might help if I had an example use case. For what filter(s) would one need to supply multiple paths?

Presently, all filters only generated one output -- the bytes-like output that the plugin creates.

nsheff commented 2 years ago

For what filter(s) would one need to supply multiple paths?

for example, one that outputs a "processed PEP" -- that is, a project config file, and a csv file, after amendments and modifiers have been applied.

nleroy917 commented 2 years ago

Notes from meeting 01/31/2022

Standard for filters

{
    "cfg": "/path/to/somewhere,
    "stdout": "/path/somewhere/else"
}

Leave the path designation to the filters:

nleroy917 commented 2 years ago

@nsheff Do these new filter functions and the controller look better? I forget if we discussed printing the name of the files to standard out if no path is specified; currently I am just printing out the contents of the file(s).

nleroy917 commented 2 years ago

I added a verbose flag to the run_filter function that defaults to True, to support pipelines (like pephub) and such so we aren't polluting std out.

also: I'm not sure why the linter test is failing... I ran black . on the entire directory

nleroy917 commented 2 years ago

Seems like this is ready -- we need to wait for peppy v0.32.0 and then change the reqs.

nsheff commented 2 years ago

OK -- peppy is now released, so can we revisit this?

nsheff commented 2 years ago

Does this need to be part of 0.1.6 release, which is imminent?

nleroy917 commented 2 years ago

Yes, it permits the programmatic conversions for pephub

nsheff commented 2 years ago

also: I'm not sure why the linter test is failing... I ran black . on the entire directory

Sometimes this is due to a version mismatch of black. try updating black?

nleroy917 commented 2 years ago

also: I'm not sure why the linter test is failing... I ran black . on the entire directory

Sometimes this is due to a version mismatch of black. try updating black?

That was it thanks.

nleroy917 commented 2 years ago

On my end, this is good to merge for 0.1.6!

nsheff commented 2 years ago

Sorry, just as I was looking at this in more depth I realize it's not doing what I was expecting. I tried to explain what I mean here, does it make sense?

nleroy917 commented 2 years ago

Yes. We are saying that there should be separate file paths for the converted config file and the converted sample table. The only confusion I have is that all filters presently don't create separate output strings for the sample table and config. For example, csv_pep_filter would only use the samples key from the input dict when we require a cfg as well... if I am following correctly.

Is this a forward compatibility thing?

nsheff commented 2 years ago

Yes. We are saying that there should be separate file paths for the converted config file and the converted sample table. The only confusion I have is that all filters presently don't create separate output strings for the sample table and config. For example, csv_pep_filter would only use the samples key from the input dict when we require a cfg as well... if I am following correctly.

Is this a forward compatibility thing?

I think you're missing something here. The filter itself will determine what paths it outputs (and therefore, what input paths need to be provided). We want to allow for filters that can handle more than 1 file path, even if none of the current filters require more than 1. Scroll up for an example:

For what filter(s) would one need to supply multiple paths?

for example, one that outputs a "processed PEP" -- that is, a project config file, and a csv file, after amendments and modifiers have been applied.

I'll create a new issue for that, because we should just go ahead and create that filter. Maybe it will help you see the vision here

nleroy917 commented 2 years ago

@nsheff can you look at the most recent commits? I also added some tests with the assumed programmatic use case.

nleroy917 commented 2 years ago

The paths keyword argument should match the keys of the dict returned from the specified plugin. The run_filter function should not expect any sort of return signature from a conversion plugin.

~If there is a key in the return object from the conversion plugin that doesn't exist in the paths dictionary, then we can just dump to standard out. Or we only return it.~

If verbose=True, then print out the result. Otherwise, if a path is missing from paths, then we can just skip it? If the conversion function was returning more than one item and they aren't mentioned in the paths argument, then we can warn the user.

Also, we will need to update the cli parser to accept key-value pairs:

--paths cfg=/path/to/cfg samples=/another/to/samples

For this implementation see:

nleroy917 commented 2 years ago

Ok. @nsheff your review would be appreciated. Three things:

  1. The new run_filter function in conversion.py
  2. The new cli parser
  3. The unit test I wrote

We also need to address the original issue which motivated this (#39)

nsheff commented 2 years ago

Ok. @nsheff your review would be appreciated. Three things:

1. The new `run_filter` function in `conversion.py`

2. The new `cli` parser

3. The unit test I wrote

We also need to address the original issue which motivated this (#39)

  1. looks good.
  2. looks good
  3. seems fine to me, probably more we can do in the future but this is fine for now

I don't see the actual processed PEP filter, though (#39)

nleroy917 commented 2 years ago

I don't see the actual processed PEP filter, though (https://github.com/pepkit/eido/issues/39)

I can add that too. Should it be named processed_pep_filter? And return something like:

return {
    'project': ...,
    'samples': ...,
    'subsamples': ...
}

peppy applies the sample modifiers on instantiation right?

Then paths variable needs to look like:

paths = {
    'project': '/path/to/project.yaml',
    'samples': '/path/to/samples.csv'
}
nsheff commented 2 years ago

I don't see the actual processed PEP filter, though (#39)

I can add that too. Should it be named processed_pep_filter? And return something like:

return {
    'project': ...,
    'samples': ...,
    'subsamples': ...
}

peppy applies the sample modifiers on instantiation right?

Yep! So it's super simple, I guess... read it in, and then return the items... just make sure you get the tabular versions of them, rather than the object versions. Actually, that could be a parameter for this filter: do you want the samples as a table, or as document?

nleroy917 commented 2 years ago

There's two methods on a peppy.Project instance:

prj.samples, which produces a list:

[Sample
sample_name: GSM1558746
protocol: GRO
genome: hg38, Sample
sample_name: GSM1480327
protocol: PRO
genome: hg38]

and prj.sample_table, which looks to create a tab-separated output:

sample_name                            
GSM1558746   GSM1558746      GRO   hg38
GSM1480327   GSM1480327      PRO   hg38

Looks like the second is more appropriate

nleroy917 commented 2 years ago

I've got this:

def processed_pep_filter(p, **kwargs) -> Dict[str, str]:
    """
    Processed PEP filter, that returns the converted sample and subsample tables.
    This filter can return the tables as a table or a document.
    :param peppy.Project p: a Project to run filter on
    :param bool samples_as_table: Flag to write as a table
    """
    # get params
    samples_as_table = kwargs.get("samples_as_table")

    prj_repr = p.config.to_dict()

    if samples_as_table:
        sample_repr = p.sample_table.to_csv()
        subsample_repr = p.subsample_table.to_csv()
    else:
        sample_repr = p.sample_table
        subsample_repr = p.subsample_table

    return {
        "project": str(prj_repr),
        "samples": str(sample_repr),
        "subsamples": str(subsample_repr)
    }

But the difference is minuscule... one is a csv output and one is a tsv output. Those seem identical to me.

nsheff commented 2 years ago

use csv.

I'd just do it like this::


    return {
        "project": str(prj_repr),
        "samples": str(p.sample_table.to_csv()) if samples_as_table else str(p.samples),
        "subsamples": str(p.subsample_table.to_csv()) if subsamples_as_table else str(p.subsamples)
    }
nsheff commented 2 years ago

or, actually I'd probably make table the default; so flip those and use samples_as_objects, etc.

nsheff commented 2 years ago

ok looks good to me. maybe just test it for your purposes with pephub and then I think it's ready to merge.

nleroy917 commented 2 years ago

@nsheff I've tested this with pephub and it functions well for what we discussed, I believe.