pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

Feature request: `peppy.Project.config.to_yaml()` doesn't accurately reproduce original config file #399

Closed nleroy917 closed 1 month ago

nleroy917 commented 2 years ago

Overview

Over at PEPhub, I am implementing a feature to download a PEP as a zipped folder. To do this, I am recapitulating the PEP using the internal state of the peppy.Project object returned from pepagent and the database. However, the peppy.Project.config object saves file path configurations as relative to where the peppy.Project object was instantiated, rather than the actual value in the file. This becomes problematic when downloading PEPs since the file path in the project_config.yaml file and the actual file location are now out of sync. Here is an example:

Example

The config file:

cat examples/demo/basic/project_config.yaml
pep_version: "2.0.0"
sample_table: sample_table.csv

At the root of pephub (/):

>>> import peppy
>>> p = peppy.Project("examples/demo/basic/project_config.yaml")
>>> p.config
pep_version: 2.0.0
sample_table: examples/demo/basic/sample_table.csv

Inside the PEP folder (/examples/demo/basic/project_config.yaml):

>>> import peppy
>>> p = peppy.Project("project_config.yaml")
>>> p.config
pep_version: 2.0.0
sample_table: sample_table.csv

Note how the sample_table attribute is changing depending on where the PEP was instantiated

I suspect peppy uses this state internally to initialize things and it was never intended that the original config file would need to be reproduced. This is less of a bug, I guess, and more of a feature request. I could manually update the internal state of peppy.Project and just strip the relative paths using os.path.basename, but it would be nice to have a method that can recreate the PEP assuming all files are next to one another.

Supplemental Material :D

Here is the zipping function where I build back up the PEP:

def zip_pep(project: peppy.Project) -> Response:
    """Zip a project up to download"""
    content_to_zip = {}

    if project.config:
        cfg_filename = basename(project.config_file)
        content_to_zip[cfg_filename] = project.config.to_yaml()
    if project.sample_table is not None:
        sample_table_filename = basename(project.to_dict().get('sample_table', "sample_table.csv"))
        content_to_zip[sample_table_filename] = project.sample_table.to_csv()
    if project.subsample_table is not None:
        # sometimes the subsample table is a list. So change behavior
        # based on this
        if not isinstance(project.subsample_table, list):
            subsample_table_filename = basename(project.to_dict().get('subsample_table', "subsample_table.csv"))
            content_to_zip[subsample_table_filename] = project.subsample_table.to_csv()
        else:
            subsample_table_filenames = project.to_dict().get('subsample_table', "subsample_table.csv")
            for sstable, sstable_filename in zip(project.subsample_table, subsample_table_filenames):
                subsample_table_filename = basename(sstable_filename)
                content_to_zip[subsample_table_filename] = **sstable.to_csv()**
nsheff commented 2 years ago

Hmm.

I'm a little surprised the peppy adjusts the content in the config file, to be honest.

I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?

@Khoroshevskyi with your experience with all the to_dict stuff, can you comment on this?

khoroshevskyi commented 2 years ago

Hmm.

I'm a little surprised the peppy adjusts the content in the config file, to be honest.

I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?

@Khoroshevskyi with your experience with all the to_dict stuff, can you comment on this?

to_dict function will take all the content of _config and _sample_df variables, it won't change anything. But I agree with @nleroy917 , that this variables are changing during peppy processing. It won't influence on methods that are initializing peppy from dict, but it will initial paths, that can cause errors in the future (e.g. in this Nathans issue).

stolarczyk commented 2 years ago

I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?

Yeah, this seems like a sensible approach. I'm not sure what was my motivation for implementing it the other way. I would have done what you suggest if this issue had manifested earlier.