[Feature Request] Support custom location of pypgx-bundle files

lexotero commented 1 year ago

Currently, pypgx defaults in multiple places to files present in pypgx-bundle. To locate said files pypgx assumes that pypgx-bundle has been cloned in the user's home directory. This is done with code similar to:

home = os.path.expanduser('~')
panel = f'{home}/pypgx-bundle/1kgp/{assembly}/{gene}.vcf.gz'

from https://github.com/sbslee/pypgx/blob/decb65a9ff7b906ad7cf2a031d646a87e701fcd3/pypgx/api/utils.py#L822

It would be useful to support different locations for the pypgx-bundle clone so that users can organise their filesystems in any way they like.

Although, in the future, the use of files from the pypgx-bundle could be handle differently, for now, an easy way of adding some flexibility would be to extract the fetching of files from the bundle to a utility function. This utility function then could have some logic to determine the path to the pypgx-bundle clone using user-defined environment variables, and defaulting to the user's home directory. Something like:

def estimate_phase_beagle(imported_variants, panel=None, impute=False):
    [...]
    gene = imported_variants.metadata["Gene"]
    assembly = imported_variants.metadata["Assembly"]
    [...]

    if panel is None:
        pypgx_bundle_panel_file = f"1kgp/{assembly}/{gene}.vcf.gz"
        panel = resolve_pypgx_bundle_path(relative_file_path=pypgx_bundle_panel_file)
    [...]

def resolve_pypgx_bundle_path(relative_file_path: str) -> str:
    """
    Given a relative file path for a file inside the pypgx-bundle, this
    function returns an absolute path to the file.

    This function uses the PYPGX_BUNDLE_LOCATION environment variable to
    determine where pypgx-bundle is in the filesystem. If the environment
    variable is not defined, it defaults to the user's home directory (i.e. ~).

    @param relative_file_path: the path to a file inside pypgx-bundle relative
    to the top level of the project's directory tree.
    """
    # Ensure file path is indeed a relative path
    relative_file_path = relative_file_path.lstrip("/")
    pypgx_bundle_location = os.environ.get("PYPGX_BUNDLE_LOCATION", "~")
    pypgx_bundle_path = pathlib.Path(pypgx_bundle_location).resolve()
    pypgx_file_path = pypgx_bundle_path / relative_file_path
    return str(pypgx_file_path)

The resolve_pypgx_bundle_path() function could also add some validation to ensure, for example, that the returned path exists.

Do you think this would be a useful addition to the library?

sbslee commented 1 year ago

@lexotero,

Thank you so much for this feature request and also for providing sample code! I really appreciate your contribution. I have been meaning to add an option to manually specify the file path to pypgx-bundle, and your solution using an environment variable is a clever one! I was originally thinking in the line of adding a command line option because I think this makes it very explicit as for which pypgx-bundle was used for a given run. I'd like to take some time to think about the pros and cons for each approach. But please do feel free to give your take on these two approaches.

BTW, I'm curious to hear more about your motivation for changing the path for pypgx-bundle. I understand that you'd like to organize your filesystems, but were there any specific reasons you could not or did not want to put pypgx-bundle to your home directory?

lexotero commented 1 year ago

I agree an optional CLI argument would be quite explicit. It could be a top-level CLI option that applies to all subcommands.

The main advantage I see of using the environment variable solution, is that you don't need to pass the value of the CLI argument around to all functions that need the location of pypgx-bundle.

You could, perhaps, combine both solutions and set the environment variable with the value of the CLI argument, this way a user can either define the environment variable or be explicit and pass the path to pypgx-bundle as a CLI argument. The code then internally uses the CLI argument once to set the environment variable, and when getting paths relative to the pypgx-bundle installation, you simply inspect the env variable (like in my example implementation). I don't know, this might turn out to be ugly and confusing...

Regarding your question about the motivation. Mainly, I'm thinking about flexibility in different environments. Some situations that come to mind:

Running pypgx with a service user without home directory.
Running multiple versions of pypgx in the same system simultaneously.
Some companies backup full home directories and might have limits on available space for each staff member.
Running on AWS Lambda (probably other limitations before you even have to deal with the location of pypgx-bundle).

I'd say there are workarounds, like setting the HOME environment variable to point to a different place in the filesystem, but adding the flexibility to pypgx sounds like a better solution long term.

sbslee commented 1 year ago

@lexotero,

Thank you so much for providing such a detailed explanation! It's exactly what I was hoping to hear. I hadn't really considered situations where a user doesn't have access to the home directory, but it's an important observation for sure.

As for CLI option vs. environment variable, I am still leaning towards the former because 1) many PyPGx users are not necessarily proficient in programming and I think CLI option is more explicit and user-friendly, 2) I counted and there are only four PyPGx actions that will be affected if I introduced the CLI option (run-ngs-pipeline, run-chip-pipeline, predict-cnv, and estimate-phase-beagle), and 3) I think it's more consistent with the overall CLI design of the tool.

I will publish the official release of 0.21.0 this weekend and then will start working on adding the said CLI option in the 0.22.0-dev branch.

lexotero commented 1 year ago

Thank you for considering this feature!

Let me know if you want any help reviewing/testing.

sbslee commented 1 year ago

@lexotero,

After careful consideration, I came to the conclusion that your approach is better because it's way simpler to implement and also because if a user is looking to customize the path to pypgx-bundle it probably means he or she is already quite famialir with the concept of environment variables in the first place (i.e. they are sort of 'advanced' users)!

Therefore, as promised, I had released the official version of 0.21.0 yeseterday and today I added the feature you requested to the 0.22.0-dev branch using the environment variable PYPGX_BUNDLE. I'd greatly appreciate it if you could test this new feature and let me know if everything works as expected. You can access it by:

$ git clone https://github.com/sbslee/pypgx
$ cd pypgx
$ git checkout 0.22.0-dev
$ pip install .

You can also see the updated Read the Docs as well. Thank you so much!

lexotero commented 1 year ago

Hello @sbslee ,

I'll check and get back to you today.

Thanks for your help.

lexotero commented 1 year ago

Hello @sbslee,

A friendly coworker (@erichards52) with actual field knowledge helped me out with the testing:

If the CLI argument is specified, pypgx-bundle is not used.
If no CLI argument specified, files are fetch from the PYPGX_BUNDLE directory.
If PYPGX_BUNDLE is not a valid path, you get a friendly error pypgx.sdk.utils.BundleNotFoundError: /this/should/not/work.
If PYPGX_BUNDLE is not specified, it defaults to ~/pypgx-bundle.

I think that covers the expected behaviour.

sbslee commented 1 year ago

Thank you so much, @lexotero and @erichards52! I really appreciate it. Please don't hesitate to let me know if you have any other feature ideas or questions!

sbslee / pypgx

[Feature Request] Support custom location of pypgx-bundle files #100