harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

issues with packages for creating enviornment #161

Closed jasonwjohns closed 3 months ago

jasonwjohns commented 3 months ago

Hello,

First, thank you for developing what looks to be a beautiful workflow. It will be so nice to have all of the data processing tools in one place and one pipeline. I've been having trouble getting the snparcher snakemake workflow to run after almost a full day of troubleshooting, so here I am. I haven't kept track of everything that I've tried, and I won't try to recount it all here, but I'm wondering if providing a few basics of my OS and a few settings can help guide the process.

I'm using a computer with a Mac M2 chip Sonoma 14.1.1. I'm running a zsh terminal using Rosetta.

I installed Miniforge by running the Miniforge3-MacOSX-arm64.sh script. I noticed on the Miniforge Github that the mac silicon chips have not been tested, so maybe this is the issue? For what it's worth, I'm able to create a mamba environment and activate it via the instructed commands:

mamba create -c conda-forge -c bioconda -n snparcher "snakemake==7.32.4" "python==3.11.4"
mamba activate snparcher

I then clone the Github repo from harvardinformatics, as instructed. When I go to run the test script I get some version of the following error:

(snparcher) jason@Jasons-MacBook-Air snpArcher % snakemake -d .test/ecoli --cores 1 --use-conda
Building DAG of jobs...
Creating conda environment /Users/jason/snpArcher/workflow/rules/../envs/cov_filter.yml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /Users/jason/snpArcher/workflow/rules/../envs/cov_filter.yml:
Command:
mamba env create --quiet --file "/Users/jason/snpArcher/.test/ecoli/.snakemake/conda/065fe9be415abd326cf869269e4d29fa_.yaml" --prefix "/Users/jason/snpArcher/.test/ecoli/.snakemake/conda/065fe9be415abd326cf869269e4d29fa_"
Output:
Could not solve for environment specs
The following packages are incompatible
├─ bedtools 2.30.0  does not exist (perhaps a typo or a missing channel);
├─ binutils does not exist (perhaps a typo or a missing channel);
├─ d4tools >=0.3.4  does not exist (perhaps a typo or a missing channel);
└─ gcc does not exist (perhaps a typo or a missing channel).

Interestingly, if I run the same command several times in a row, each time the error gives a different list of incompatible packages. If I try to install any one package manually with conda, I get the error message below saying the package is not available:

(snparcher) jason@Jasons-MacBook-Air snpArcher % conda install gcc
Channels:
 - conda-forge
 - bioconda
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - gcc

Current channels:

  - https://conda.anaconda.org/conda-forge
  - https://conda.anaconda.org/bioconda/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Is this an issue with my using an M2 chip, or maybe some other issue?

Thank you in advance for any guidance you can provide! Jason

cademirch commented 3 months ago

Hi Jason,

Yes this is indeed an issue with Apple Silicon.

I would recommend remaking your snparcher env like so:

CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snparcher "snakemake==7.32.4" "python==3.11.4"
conda activate snparcher
conda config --env --set subdir osx-64

Thanks for reporting this - I will make sure to add a note to the docs.

cademirch commented 3 months ago

Actually in this particular case, it might be more than just the Apple Silicon issue, as I'm running into this even with the above commands. Will dig into this and report back.

jasonwjohns commented 3 months ago

Hi Cade,

Thanks very much for the quick response. I did as you suggested but unfortunately got the same error. When I ran the test snakemake workflow again I got a similar error, but with sambamba as the only listed incompatible package.

Could not solve for environment specs
The following package could not be installed
└─ sambamba 0.8.0  does not exist (perhaps a typo or a missing channel).

Then the third time it just listed bedtools and genmap. Maybe this is due to caching or something?

Could not solve for environment specs
The following packages are incompatible
├─ bedtools 2.30.0  does not exist (perhaps a typo or a missing channel);
└─ genmap >=1.3.0  does not exist (perhaps a typo or a missing channel).

Any chance you have any other suggestions, or am I SOL for now with Apple Silicon?

Thank you! Jason

jasonwjohns commented 3 months ago

Oops just saw your above comment about looking into it further. Thank you Cade!

For what it's worth, I've tried the troubleshooting on both my personal computer and our lab computer with pretty much the same results. Both have M2 chips...

Thanks!

cademirch commented 3 months ago

Interesting, I was able to build the conda envs except cov_filter.yaml (which has the gcc) entry. Also on M2. When you activate your snparcher env, can you do conda config --show? You should see these lines near the bottom:

...
subdir: osx-64
subdirs:
  - osx-64
  - noarch
...
jasonwjohns commented 3 months ago

I do get those subdir lines as well when running conda config --show.

I got to the same place that you did with getting the gcc error yesterday. This morning I reinstalled miniforge and tried it again, which gave the different errors included in the above messages.

The differences in error messages may be been due to the fact that I uninstalled miniforge and conda this morning, then reinstalled miniforge with the apple silicon .sh file. Yesterday I was working with my existing conda installation + miniforge downloaded with the curl command from their github:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
cademirch commented 3 months ago

Okay. In the interim, to get around this you can disable the cov_filter in your own workflow and the tests by setting that value to False in the config.yaml. Also if you have access to a linux machine, snpArcher should run fine there, though I understand this may not be an option.

Sorry again this has caused you to sink so much time into troubleshooting! I have been there many times too with conda and mamba unfortunately.

jasonwjohns commented 3 months ago

No worries at all. This is the nature of the beast, and I'm sure the workflow will save me a lot of time and heartache in the long run. I'll actually be using it for a bunch of CCGP data, true to its roots!

Thanks for the workaround too. I may eventually run at least some of my data on a cluster, so that should solve these issues. I suppose if I end up wanting to incorporate coverage thresholds I can do so in post processing?

Also, even after adjusting the cov_filter value to False, I ended up with an error code. I got quite a long way through the workflow before getting stopped at the gatk_db_import step. Below is what I got:

Activating conda environment: .snakemake/conda/ed0cf9835b6fd7e510d59bd0c0312a92_
tar: Option --overwrite is not supported
Usage:
  List:    tar -tf <archive-filename>
  Extract: tar -xf <archive-filename>
  Create:  tar -cf <archive-filename> [filenames...]
  Help:    tar --help
[Wed Mar  6 16:26:15 2024]
Error in rule gvcf2DB:
    jobid: 51
    input: results/dmAquExim1.NCBI.p_ctg.fasta/gvcfs/jon_test_1.g.vcf.gz, results/dmAquExim1.NCBI.p_ctg.fasta/gvcfs/jon_test_1.g.vcf.gz.tbi, results/dmAquExim1.NCBI.p_ctg.fasta/intervals/db_intervals/0005-scattered.interval_list, results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_mapfile.txt
    output: results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005, results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005.tar
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/gatk_db_import/0005.txt (check log file(s) for error details)
    conda-env: /Users/johns/test/.snakemake/conda/ed0cf9835b6fd7e510d59bd0c0312a92_
    shell:

        export TILEDB_DISABLE_FILE_LOCKING=1
        gatk GenomicsDBImport             --java-options '-Xmx25600m -Xms25600m'             --genomicsdb-shared-posixfs-optimizations true             --batch-size 25             --genomicsdb-workspace-path results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005             --merge-input-intervals             -L results/dmAquExim1.NCBI.p_ctg.fasta/intervals/db_intervals/0005-scattered.interval_list             --tmp-dir /var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T             --sample-name-map results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_mapfile.txt &> logs/dmAquExim1.NCBI.p_ctg.fasta/gatk_db_import/0005.txt

        tar --overwrite -cf results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005.tar results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job gvcf2DB since they might be corrupted:
results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0005

This same error was thrown for DB_L0000 - DB_L0006, although not necessarily in that order. The order was 5,2,3,4,1,0,6 in case that matters. Figured I'd just type that instead of pasting a huge chunk of repetitive errors but let me know if you'd like to see them all.

Then I got:

Removing output files of failed job gvcf2DB since they might be corrupted:
results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0000
tar: Option --overwrite is not supported
Usage:
  List:    tar -tf <archive-filename>
  Extract: tar -xf <archive-filename>
  Create:  tar -cf <archive-filename> [filenames...]
  Help:    tar --help
[Wed Mar  6 16:31:50 2024]
Error in rule gvcf2DB:
    jobid: 54
    input: results/dmAquExim1.NCBI.p_ctg.fasta/gvcfs/jon_test_1.g.vcf.gz, results/dmAquExim1.NCBI.p_ctg.fasta/gvcfs/jon_test_1.g.vcf.gz.tbi, results/dmAquExim1.NCBI.p_ctg.fasta/intervals/db_intervals/0006-scattered.interval_list, results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_mapfile.txt
    output: results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006, results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006.tar
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/gatk_db_import/0006.txt (check log file(s) for error details)
    conda-env: /Users/johns/test/.snakemake/conda/ed0cf9835b6fd7e510d59bd0c0312a92_
    shell:

        export TILEDB_DISABLE_FILE_LOCKING=1
        gatk GenomicsDBImport             --java-options '-Xmx25600m -Xms25600m'             --genomicsdb-shared-posixfs-optimizations true             --batch-size 25             --genomicsdb-workspace-path results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006             --merge-input-intervals             -L results/dmAquExim1.NCBI.p_ctg.fasta/intervals/db_intervals/0006-scattered.interval_list             --tmp-dir /var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T             --sample-name-map results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_mapfile.txt &> logs/dmAquExim1.NCBI.p_ctg.fasta/gatk_db_import/0006.txt

        tar --overwrite -cf results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006.tar results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job gvcf2DB since they might be corrupted:
results/dmAquExim1.NCBI.p_ctg.fasta/genomics_db_import/DB_L0006
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-06T162443.659152.snakemake.log

I've attached the 6 log files that were generated for the gatk_db_import step in case they're useful.

Could the fact that I'm using zsh instead of bash be an issue?

0000.txt 0001.txt 0002.txt 0003.txt 0004.txt 0005.txt 0006.txt

cademirch commented 3 months ago

Ah, sorry about this. I've made a bracnh osx-fixes to document these incompatibilities and fix them. This has to do with OSX shipping with bsdtar whereas most Linux distros ship with GNUtar. Looks like the former does not support --overwrite, but the later does. I've removed that option in the branch osx-fixes and it works on my M2 mac.

Go ahead and pull that branch and give it a try. Thanks again for all your help!

jasonwjohns commented 3 months ago

Thank you for all of your help! The workflow seems to have run well, both with the ecoli sample data provided and on the single sample that I ran with some old data of mine.

The one thing I didn't find for either run was the QC output. I'm not super suprised that it didn't work with my data, as I may not have set a configuration properly, but should the ecoli test run have produced a "..._qx.html" file?

I noticed on the sample QC output from your publication that an interactive map can be produced if a .coords file is provided. It says to see the project README for more details, but I didn't find those details in the README. Is this something that's easily available? If not it's no big deal of course.

Thanks again!

cademirch commented 3 months ago

Glad it worked. For the QC dashboard, there are some admittedly undocumented requirements, namely you need at least 2 samples for a given reference genome in your sample sheet. The ecoli dataset only has 2 samples and 2 reference genomes so it doesn't generate the QC dashboard. This is because really small datasets typically don't generate enough SNPs to perform some of the analysis required for the QC dashboard.

Thanks for pointing out the maps issue - you can provide decimal latitude and longitude values via the sample sheet with the columns lat and long, respectively.

Edit: You can reference the QC maps test sample sheet , to see how to set it up.

I've added these bits to the docs in #164.

jasonwjohns commented 3 months ago

I thought that might be the case re: the multiple samples requirement.

This is all really great. Sounds like I'm off and running. Thanks so much Cade!

cademirch commented 3 months ago

Great. I'll close this for now. Don't hesitate to open a new issue with other problems and/or feedback!