iqbal-lab-org / pling

Plasmid analysis using rearrangement distances
MIT License
31 stars 1 forks source link

Snakemake thinks there's whitespaces in filepaths? #61

Closed jhawkey closed 5 months ago

jhawkey commented 6 months ago

Hi Daria,

I saw Zam's talk about pling and it looks amazing! I am excited to give it a try on the data we have down here in Melbourne.

Unfortunately I'm getting a snakemake error, complaining about whitespaces in my filepaths? I can't see any whitespaces, so I'm not entirely sure what's going on. I've never used snakemake before (we are nextflow users over here) and so not certain how to troubleshoot.

This is my command (no whitespaces???):

PYTHONPATH=/mnt/nectar/analyses/plasmid_comparison_dev/pling python /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output align

Based on the instructions, I don't think I need to provide *.fasta or anything, right? Just the location of the folder?

Anyway, this is what pling returns as an error:

Batching...

Building DAG of jobs...
File path ' /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches ' starts with whitespace. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
File path ' /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches ' ends with whitespace. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
File path ' /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches ' starts with whitespace. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
File path ' /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches ' ends with whitespace. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job            count
-----------  -------
all                1
get_batches        1
total              2

Select jobs to execute...

[Fri May 31 03:51:36 2024]
rule get_batches:
    output:  /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches
    jobid: 1
    reason: Missing output files:  /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches
    resources: tmpdir=/tmp, mem_mb=4000, mem_mib=3815

Activating conda environment: .snakemake/conda/8e6ac951f61150f3f00e9be5b81acf1d_
usage: get_batches.py [-h] [--genomes_list GENOMES_LIST]
                      [--batch_size BATCH_SIZE] [--outputpath OUTPUTPATH]
                      [--sourmash] [--smash_threshold SMASH_THRESHOLD]
                      [--containmentpath CONTAINMENTPATH]
                      [--dcj_path DCJ_PATH]
get_batches.py: error: unrecognized arguments: /tmp_files/containment_batchwise/not_pairs_containment_distance.tsv
[Fri May 31 03:51:36 2024]
Error in rule get_batches:
    jobid: 1
    output:  /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /batches
    conda-env: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/.snakemake/conda/8e6ac951f61150f3f00e9be5b81acf1d_
    shell:

        PYTHONPATH=/mnt/nectar/analyses/plasmid_comparison_dev/pling python /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/get_batches.py             --genomes_list /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test             --batch_size 50             --outputpath /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output                          --smash_threshold 1             --containmentpath  /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output /tmp_files/containment_batchwise/not_pairs_containment_distance.tsv

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-31T035135.031168.snakemake.log

Command 'snakemake --snakefile /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/Snakefile --configfile /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/tmp_files/config.yaml --cores 1 --use-conda --rerun-incomplete --nolock  ' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 182, in <module>
    main()
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 179, in main
    pling(args)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 130, in pling
    raise e
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 125, in pling
    subprocess.run(f"snakemake --snakefile {get_pling_path()}/batching/Snakefile {snakemake_args}", shell=True, check=True, capture_output=True)
  File "/mnt/nectar/conda_envs/pling/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'snakemake --snakefile /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/Snakefile --configfile /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/tmp_files/config.yaml --cores 1 --use-conda --rerun-incomplete --nolock  ' returned non-zero exit status 1.

Any suggestions? I assume I'm just providing the command incorrectly/doing something dumb.

jhawkey commented 6 months ago

Oh and I'll add - I'm using python v3.12 and snakemake v7.32.4

iqbal-lab commented 6 months ago

Have been meaning to call you ! Will let @babayagaofficial reply

babayagaofficial commented 6 months ago

Hi, which version of pling are you using? This ought to have been fixed in v1.0.1...

jhawkey commented 6 months ago

I can't seem to easily get the version - I ran:

python run_pling.py --version
Traceback (most recent call last):
  File "run_pling.py", line 14, in <module>
    import yaml
ModuleNotFoundError: No module named 'yaml'

But when I navigate into the pling folder and run git status it says I'm up to date with the the main branch. So presumably I'm running v1.0.1?

Edited to add, also tried:

PYTHONPATH=/mnt/nectar/analyses/plasmid_comparison_dev/pling/ python /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py --version
Traceback (most recent call last):
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 14, in <module>
    import yaml
ModuleNotFoundError: No module named 'yaml'
babayagaofficial commented 6 months ago

Kind of weird that it's erroring out without finding the yaml module, that should've installed along with snakemake (it's part of its dependencies), especially since it made it to batching step without any problems before. But anyway, the presence of the yaml module tells me that you're on the right version anyway.

Does the tmp directory with a config.yaml file exist in the pling output directory? If so, can you please send it to me?

For context, we create a config.yaml at the beginning to feed in filepaths and other inputs into snakemake later on. I've had a similar bug reported to me before, and then the problem was that extra whitespace was being added in the creation of the config file, but it was fixed when I started creating the config file through the yaml module. This looks a lot like that bug, but unfortunately it seems using the yaml module wasn't enough of a fix.

jhawkey commented 6 months ago

Yes, the config.yaml file exists inside a tmp_files directory, inside the output directory. This is the contents of config.yaml:

cat config.yaml
bakta_db: None
bakta_mem: 15000
bakta_threads: 1
batch_size: 50
bh_connectivity: 10
bh_neighbours_edge_density: 0.2
blocks_mem: 4000
build_DCJ_graph_mem: 8000
communities: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/containment/containment_communities
consolidation_mem: 4000
dcj_dist_mem: 4000
dcj_dist_threshold: 4
dcj_matrix_mem: 4000
deduplication_mem: 10000
deduplication_threads: 1
genomes_list: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test
get_communities_mem: 4000
identity_threshold: 80.0
ilp_mem: 10000
ilp_solver: GLPK
ilp_threads: 1
integerisation: align
length_threshold: 200
make_unimogs_mem: 10000
make_unimogs_threads: 1
metadata: None
minimap_mem: 4000
minimap_threads: 1
output_dir: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output
pairwise_seq_containment_mem: 10000
pairwise_seq_containment_threads: 1
panaroo_mem: 15000
panaroo_threads: 1
prefix: all_plasmids
seq_containment_distance: 0.5
small_subcommunity_size_threshold: 4
sourmash_mem: 10000
sourmash_threads: 1
timelimit: None
unimog_to_ilp_mem: 4000
babayagaofficial commented 6 months ago

Thank you!

So I was able to replicate the error when running on python 3.12 and snakemake 7.32.4, but when I dropped to python version 3.11 it was fine. Can you please try downgrading python to 3.11 and let me know if you still get the same error?

babayagaofficial commented 6 months ago

Ah, I just found out this is a bug in Snakemake: https://github.com/snakemake/snakemake/issues/2480

The solution is downgrading python to version 3.11, so hopefully that solves the matter and I just need to update the documentation.

iqbal-lab commented 6 months ago

Jesus christ

jhawkey commented 6 months ago

Hey Daria,

Thanks, downgrading to python 3.11 fixed that error.

Sadly I'm getting another one (sorry!!). This time it seems to not like the fact that the location where my fasta files are is a directory?

PYTHONPATH=/mnt/nectar/analyses/plasmid_comparison_dev/pling/ python /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output align
Batching...

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job            count
-----------  -------
all                1
get_batches        1
total              2

Select jobs to execute...

[Sat Jun  1 00:15:04 2024]
rule get_batches:
    output: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/batches
    jobid: 1
    reason: Missing output files: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/batches
    resources: tmpdir=/tmp, mem_mb=4000, mem_mib=3815

Activating conda environment: .snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_
Traceback (most recent call last):
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/get_batches.py", line 107, in <module>
    main()
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/get_batches.py", line 87, in main
    genomes, genome_index = get_labels(args.genomes_list)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/get_batches.py", line 16, in get_labels
    fastafiles, fastaext, fastapath = get_fasta_file_info(filepath)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/utils.py", line 22, in get_fasta_file_info
    FASTAFILES_LIST = [el[0] for el in pd.read_csv(genomes_list, header=None).values]
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_/lib/python3.10/site-packages/pandas/io/common.py", line 856, in get_handle
    handle = open(
IsADirectoryError: [Errno 21] Is a directory: '/mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test'
[Sat Jun  1 00:15:05 2024]
Error in rule get_batches:
    jobid: 1
    output: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/batches
    conda-env: /mnt/nectar/analyses/plasmid_comparison_dev/.snakemake/conda/50d37ffbaf41b1426e2ae3d8c4fe3997_
    shell:

        PYTHONPATH=/mnt/nectar/analyses/plasmid_comparison_dev/pling python /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/get_batches.py             --genomes_list /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test             --batch_size 50             --outputpath /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output                          --smash_threshold 1             --containmentpath /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/tmp_files/containment_batchwise/not_pairs_containment_distance.tsv

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-06-01T001503.452318.snakemake.log

Command 'snakemake --snakefile /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/Snakefile --configfile /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/tmp_files/config.yaml --cores 1 --use-conda --rerun-incomplete --nolock  ' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 182, in <module>
    main()
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 179, in main
    pling(args)
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 130, in pling
    raise e
  File "/mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/run_pling.py", line 125, in pling
    subprocess.run(f"snakemake --snakefile {get_pling_path()}/batching/Snakefile {snakemake_args}", shell=True, check=True, capture_output=True)
  File "/mnt/nectar/conda_envs/pling/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'snakemake --snakefile /mnt/nectar/analyses/plasmid_comparison_dev/pling/pling/batching/Snakefile --configfile /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/tmp_files/config.yaml --cores 1 --use-conda --rerun-incomplete --nolock  ' returned non-zero exit status 1.

The contents of my input directory looks like this:

INF344_INF344_plasmid_1.fasta  INF355_INF355_plasmid_1.fasta  INF361_INF361_plasmid_1.fasta

It made the output directory, and this the content of the config.yaml:

bakta_db: None
bakta_mem: 15000
bakta_threads: 1
batch_size: 50
bh_connectivity: 10
bh_neighbours_edge_density: 0.2
blocks_mem: 4000
build_DCJ_graph_mem: 8000
communities: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output/containment/containment_communities
consolidation_mem: 4000
dcj_dist_mem: 4000
dcj_dist_threshold: 4
dcj_matrix_mem: 4000
deduplication_mem: 10000
deduplication_threads: 1
genomes_list: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_plasmid_inputs/small_test
get_communities_mem: 4000
identity_threshold: 80.0
ilp_mem: 10000
ilp_solver: GLPK
ilp_threads: 1
integerisation: align
length_threshold: 200
make_unimogs_mem: 10000
make_unimogs_threads: 1
metadata: None
minimap_mem: 4000
minimap_threads: 1
output_dir: /mnt/nectar/analyses/plasmid_comparison_dev/pling_ctxm15_smallTest_output
pairwise_seq_containment_mem: 10000
pairwise_seq_containment_threads: 1
panaroo_mem: 15000
panaroo_threads: 1
prefix: all_plasmids
seq_containment_distance: 0.5
small_subcommunity_size_threshold: 4
sourmash_mem: 10000
sourmash_threads: 1
timelimit: None
unimog_to_ilp_mem: 4000

Currently running with python v3.11.6 and snakemake v7.32.4. Pling says it's v1.0.3.

babayagaofficial commented 6 months ago

If I've understood correctly, you've passed the directory path as the genomes_list input, which isn't the right input -- Pling needs a text file with a list of paths to each individual fasta file. If you run the command

ls -d -1 $PWD/*.fasta > input.txt

and feed in the path to input.txt for genomes_list, it should work!

jhawkey commented 5 months ago

Ah, thanks! That wasn't clear to me from the docs, you may want to consider updating the readme with an example command that demonstrates that the input is a text file.

I just finished running a test set and it's worked wonderfully. Looking forward to trying it out on some data where I don't know what's going on!