harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

running QC module #191

Open jasonwjohns opened 1 month ago

jasonwjohns commented 1 month ago

Hi guys,

I'm trying to re-run the QC module on some snpArcher output from CCGP data. I can't seem to figure out the below error, as the input file is indeed in place.

Assuming unrestricted shared filesystem usage. Building DAG of jobs... MissingInputException in rule vcftools_individuals in file /Users/johns/snpArcher/workflow/modules/qc/Snakefile, line 26: Missing input files for rule vcftools_individuals: output: results/A.eximia/QC/32-Aquilegia.idepth, results/A.eximia/QC/32-Aquilegia.imiss, results/A.eximia/QC/32-Aquilegia.samps.txt, results/A.eximia/QC/32-Aquilegia.FILTER.summary, results/A.eximia/QC/32-Aquilegia.het wildcards: refGenome=A.eximia, prefix=32-Aquilegia affected files: results/A.eximia/32-Aquilegia_raw.vcf.gz

I have the snpArcher repo in my home directory and the file structure of my project directory is as follows: `(snparcher) johns@lscgtech2016mbpro16 snparcher_output % ls -lhR total 0 drwxr-xr-x 5 johns staff 160B May 6 10:52 results

./results: total 0 drwxr-xr-x 13 johns staff 416B May 6 11:36 A.eximia drwxr-xr-x 4 johns staff 128B May 6 10:53 og_output

./results/A.eximia: total 334018176 -rwx------@ 1 johns staff 168K Apr 13 14:54 32-Aquilegia_README.pdf -rwx------ 1 johns staff 5.0M Apr 13 14:54 32-Aquilegia_callable_sites.bed -rwx------ 1 johns staff 137G Apr 13 16:21 32-Aquilegia_raw.vcf.gz -rwx------ 1 johns staff 518K Apr 13 14:54 32-Aquilegia_raw.vcf.gz.tbi drwxr-xr-x 3 johns staff 96B May 6 11:20 QC -rwx------ 1 johns staff 23G Apr 18 14:16 all_samples.d4 drwxr-xr-x 4 johns staff 128B May 6 10:36 config -rwx------@ 1 johns staff 0B Apr 13 16:21 done_downloading.txt drwx------ 792 johns staff 25K Apr 13 15:18 gvcfs

./results/A.eximia/QC: total 0

./results/A.eximia/config: total 72 -rw-r--r--@ 1 johns staff 431B May 6 11:31 config.yaml -rw-r--r--@ 1 johns staff 32K May 6 10:36 postprocess_samples.csv

./results/A.eximia/gvcfs: total 483970536 -rwx------ 1 johns staff 493M Apr 13 14:55 Aquexi_Ae101.g.vcf.gz -rwx------ 1 johns staff 152K Apr 13 14:54 Aquexi_Ae101.g.vcf.gz.tbi -rwx------ 1 johns staff 653M Apr 13 14:56 Aquexi_Ae102.g.vcf.gz -rwx------ 1 johns staff 156K Apr 13 14:54 Aquexi_Ae102.g.vcf.gz.tbi -rwx------ 1 johns staff 810M Apr 13 14:56 Aquexi_Ae103.g.vcf.gz -rwx------ 1 johns staff 161K Apr 13 14:54 Aquexi_Ae103.g.vcf.gz.tbi -rwx------ 1 johns staff 657M Apr 13 14:56 Aquexi_Ae104.g.vcf.gz -rwx------ 1 johns staff 157K Apr 13 14:54 Aquexi_Ae104.g.vcf.gz.tbi etc. etc. etc. `

My config.yaml file looks like: `##############################

Variables you need to change

##############################

samples: "config/postprocess_samples.csv" # name of the sample metadata CSV final_prefix: "32-Aquilegia" # prefix for final output files tmp_dir: "tmp/"

##############################

Variables you might need to change

##############################

QC options

nClusters: 3 GoogleAPIKey: min_depth: 2 scaffolds_to_exclude: ` I added the scaffolds_to_exclude line because I got an error previously.

Thank you in advance for any guidance you can provide! Jason

Thanks in advance for any help you can give! Jason

cademirch commented 1 month ago

Hi Jason, what was your command line?

Edit: Based on what your directory structure, should be something like: snakemake -s <path to snpArcher repo>/workflow/modules/qc/Snakefile -d results/A.eximia

jasonwjohns commented 1 month ago

Hey Cade,

Thank you for answering my question with the right question. The cure ended up being pretty elementary.

I think my directory structure is now set, where I can call the below from the command line:

cd /Volumes/hodges.lab/projects/ccgp/aquilegia/snparcher_output snakemake -s /Users/johns/snpArcher/workflow/modules/qc/Snakefile -d . --cores 22

The module workflow gets going just fine now, but is now tripping up at the vcftools_individuals step. As usual, I've done as much troubleshooting as I can, but I'm at a wall. Hopefully for your sake, the solution is less obvious than last time. I'm wondering if it's the syntax error, or something else. Here's the full output:

`(snparcher) johns@lscgtech2016mbpro16 snparcher_output % snakemake -s /Users/johns/snpArcher/workflow/modules/qc/Snakefile -d . --cores 22 Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /bin/bash Provided cores: 22 Rules claiming more threads will be scaled down. Conda environments: ignored Job stats: job count


admixture 1 all 1 check_fai 1 generate_coords_file 1 plink 1 qc_plots 1 setup_admixture 1 subsample_snps 1 vcftools_individuals 1 total 9

Select jobs to execute... Execute 3 jobs...

[Mon May 6 15:09:08 2024] localrule vcftools_individuals: input: results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.imiss, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.samps.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.het log: logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt jobid: 4 reason: Missing output files: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.imiss, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.het, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.samps.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary wildcards: refGenome=20240328.dmAquExim1.NCBI.hap2, prefix=32-Aquilegia resources: tmpdir=/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T, mem_mb=4000, mem_mib=3815

[Mon May 6 15:09:08 2024] localrule generate_coords_file: output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.coords.txt jobid: 8 reason: Missing output files: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.coords.txt wildcards: refGenome=20240328.dmAquExim1.NCBI.hap2, prefix=32-Aquilegia resources: tmpdir=/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T

[Mon May 6 15:09:08 2024] localrule check_fai: input: results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz, results/20240328.dmAquExim1.NCBI.hap2/data/genome/20240328.dmAquExim1.NCBI.hap2.fna.fai output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_fai_tmp.txt jobid: 5 reason: Missing output files: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia_fai_tmp.txt wildcards: refGenome=20240328.dmAquExim1.NCBI.hap2, prefix=32-Aquilegia resources: tmpdir=/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T, mem_mb=1000, mem_mib=954

[Mon May 6 15:09:09 2024] Finished job 5. 1 of 9 steps (11%) done [Mon May 6 15:09:09 2024] Finished job 8. 2 of 9 steps (22%) done /bin/bash: -c: line 2: syntax error near unexpected token `>' [Mon May 6 15:32:42 2024] Error in rule vcftools_individuals: jobid: 4 input: results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz output: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.imiss, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.samps.txt, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary, results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.het log: logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt (check log file(s) for error details) conda-env: /Users/johns/temp/ccgp_popgen/aq_postprocess/snparcheroutput/.snakemake/conda/4d0d07ef63adf0bd1e7d79edb316ca6c shell:

    vcftools --gzvcf results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz --FILTER-summary --out results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia &> logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt
    vcftools --gzvcf results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz --out results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia --depth &>> logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt
    vcftools --gzvcf results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz --out results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia --het &>> logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt
    vcftools --gzvcf results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz --out results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia --missing-indv &>> logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt
    tail -n +2 results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.idepth | awk '$3>2 {print $1}'> results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.samps.txt 2>> logs/20240328.dmAquExim1.NCBI.hap2/QC/vcftools_individuals/32-Aquilegia.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job vcftools_individuals since they might be corrupted: results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia.FILTER.summary Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-06T150908.843660.snakemake.log WorkflowError: At least one job did not complete successfully. `

I didn't find anything especially useful in the logs file, but here it is in case it means something to you:

`VCFtools - 0.1.16 (C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted: --gzvcf results/20240328.dmAquExim1.NCBI.hap2/32-Aquilegia_raw.vcf.gz --FILTER-summary --out results/20240328.dmAquExim1.NCBI.hap2/QC/32-Aquilegia

Using zlib version: 1.2.12 Warning: Expected at least 2 parts in FORMAT entry: ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> Warning: Expected at least 2 parts in FORMAT entry: ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele frequency, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele frequency, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency, for each ALT allele, in the same order as listed"> After filtering, kept 395 out of 395 Individuals Outputting Filter Summary (for bi-allelic loci only) After filtering, kept 58439177 out of a possible 58439177 Sites Run Time = 1413.00 seconds`

Thank you!! Jason

cademirch commented 1 month ago

Ah I think I've seen this before. The redirect syntax "&>>" wasn't introduced until bash 4.

So I would check your bash version. If I remember correctly, I saw this on a Mac, for some reason some macOS shipped with super old bash. If you are on Mac you should be able to update bash via brew.

Thanks for the detailed issue, and sorry for the annoying problem. I'll make sure to document this.

jasonwjohns commented 1 month ago

You're the man! The QC module ran well. Maybe I was missing something, but it didn't seem to use any of the conda environments build into the Snakemake workflow. I ended up just using my own conda environment and installing all the necessary packages manually. It was probably something I was doing wrong, but I thought I'd mention it just in case it was useful to you.

Thanks as always! Jason

cademirch commented 1 month ago

Of course. Did installing the softwares make the workflow work? This is an issue with the shell, not any of the tools.

Also make sure you have the --use-conda option. I'm not sure if using conda will solve the shell problem.

Or use the default profile:

To use the profile you would do snakemake -s snpArcher/workflow/modules/qc/Snakefile -d [your dir] --profile snpArcher/profiles/default

jasonwjohns commented 1 month ago

It's awesome how gracious you guys are about guiding me through this elementary stuff. Thanks sincerely, Cade!

Jason

On Wed, May 8, 2024 at 10:47 AM Cade Mirchandani @.***> wrote:

To use the profile you would do snakemake -s snpArcher/workflow/modules/qc/Snakefile -d [your dir] --profile snpArcher/profiles/default

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/191#issuecomment-2101094604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYTVHE7RFHH5TV4VUNRENDZBJQJ3AVCNFSM6AAAAABHJRCTO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBRGA4TINRQGQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Postdoctoral Researcher One People One Reef https://onepeopleonereef.org/ - Bernardi Lab, UC Santa Cruz California Conservation Genomics Project https://www.ccgproject.org/ - Hodges Lab, UC Santa Barbara