iqbal-lab-org / pling

Plasmid analysis using rearrangement distances
MIT License
28 stars 1 forks source link

Plasnet integration PR series (2/n): refactor: adding CLI to 4 python scripts #34

Closed leoisl closed 10 months ago

leoisl commented 10 months ago

This PR contains some refactors preparing for the main plasnet integration PR.

This PR contains a fix to an annoying issue. Basically, when turning pling into a python package, it could well be that users will run it inside a python virtual environment. If that python environment has precedence over the conda environment, which would be normal if the user is using virtual envs, then our conda rules will break. This happens with me, so I will exemplify better. This is my terminal line when I develop pling:

(plasnet-py3.11) (base) leandro@leandro-ThinkPad-X1-Carbon-6th:~/git/plasnet$

So it basically shows first the python environment that I am using while developing it ((plasnet-py3.11)) and later my conda env ((base)). Unfortunately, when snakemake activates a conda env during the execution of a script in a rule, e.g. during the execution of this rule: https://github.com/iqbal-lab-org/pling/blob/87e571ecb5cc6b7743471515d65f7a1a5355695e/pling/jac_network_snakemake/Snakefile#L40-L56 the conda env does not get priority over the python env. Thus the env when the script is run is:

(plasnet-py3.11) (conda_env_for_the_rule)

The execution will thus fail because the python used to execute the script will be the python from the python env not from the conda env for the rule. This is, however, not a bug, but the intended behaviour of conda, see https://github.com/conda/conda/issues/9392#issuecomment-1291041085

I can see two fixes for this:

  1. Use shell directive instead of script directive for these rules, and force the conda env to have precedence over the python env path before running the script;
  2. Use containers;

Maybe (2) was a better solution, but for now I've opted for (1). (1) is basically transforming

    script:
        "seq_jaccard.py"

into:

    shell: """
        PATH="$CONDA_PREFIX"/bin:$PATH  # force conda to have preference
        python pling/jac_network_snakemake/seq_jaccard.py \
            {params.genome1} {params.genome2} \
            {input.genome_1_fasta} {input.genome_2_fasta} \
            {params.identity_threshold} \
            {output.jaccard}
    """

By choosing (1), we need to add CLI parsing to all such python scripts so that we can call them from a shell directive. This is what this PR does.

I've also removed the code to do the quick tests in these scripts, I think it is time to move these tests into unit tests in the tests/ dir.

As you can see, I've made some unilateral choices here. Please feel free to disagree and let's discuss better solutions if you have!

leoisl commented 10 months ago

The description of the scripts were some placeholders I forgot to fix, doing it now!

leoisl commented 10 months ago

Actually, @babayagaofficial could you give me some few-word description to replace these generic descriptions of blocks.py: https://github.com/iqbal-lab-org/pling/blob/2ad615eeebf1e7871c54666d80abfb2bc4c8d5ef/pling/anno_snakemake/blocks.py#L192

and consolidation.py: https://github.com/iqbal-lab-org/pling/blob/2ad615eeebf1e7871c54666d80abfb2bc4c8d5ef/pling/anno_snakemake/consolidation.py#L81

I can't actually describe well what these scripts are doing...

Thanks!

babayagaofficial commented 10 months ago

blocks.py: Simplify integer sequence by finding blocks of genes that are always in the same order across all genomes, and representing those blocks with just one integer. consolidation.py: Construct integer sequence representation of genome, with integer=gene.

leoisl commented 10 months ago

Thanks, descriptions changed! Feel free to re-review this PR and accept or suggest changes!