Error running combined-counts with phased SNPs

wir963 commented 2 years ago

Hey,

When I try to run the demo using Snakemake rules, I get two errors.

The first error is for count_alleles and it basically complains that the phased SNPs don't follow the same standard as the unphased SNPs (they aren't named by chromosomes, which is apparently required when using --chromosomes). I fixed this by manually renaming the phased snps file.

The second error is for combine_counts.

rule hatchet_combine_counts:
    input: output/count_reads/phased-snps/total.tsv, output/baf/phased-snps/tumor.1bed
    output: output/bb/phased-snps/bulk.bb
    jobid: 0
    wildcards: phasing=phased
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp

ESC[1mESC[95m[2022-Sep-13 17:23:04]# Parsing and checking input arguments
ESC[0mESC[92mIdentified 1 chromosomes.
ESC[0mESC[92m
        baffile: output/baf/phased-snps/tumor.1bed
        outfile: output/bb/phased-snps/bulk.bb
        sample_names: ['normal', 'tumor1', 'tumor2', 'tumor3']
        min_snp_reads: 3000
        min_total_reads: 5000
        use_chr: True
        processes: 1
        chromosomes: ['chr22']
        array: output/count_reads
        totalcounts: output/count_reads/phased-snps/total.tsv
        phase: None
        blocksize: 25000
        max_snps_per_block: 10
        test_alpha: 0.1
        multisample: True
        ref_version: hg19
ESC[0mmultiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/utils/combine_counts.py", line 1068, in run_chromosome_wrapper
    run_chromosome(*param)
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/utils/combine_counts.py", line 1064, in run_chromosome
    raise e
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/utils/combine_counts.py", line 1009, in run_chromosome
    bins_q = adaptive_bins_arm(
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/utils/combine_counts.py", line 270, in adaptive_bins_arm
    assert len(snp_positions) == len(snp_thresholds) - 1, (
AssertionError: (251142, 251384)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/robinsonwi/miniconda3/envs/hatchet/bin/hatchet", line 8, in <module>
    sys.exit(main())
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/__main__.py", line 61, in main
    globals()[command](args)
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/site-packages/hatchet/utils/combine_counts.py", line 98, in main
    p.map(run_chromosome_wrapper, params)
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/data/robinsonwi/miniconda3/envs/hatchet/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AssertionError: (251142, 251384)
[Tue Sep 13 17:23:06 2022]
Error in rule hatchet_combine_counts:
    jobid: 0
    output: output/bb/phased-snps/bulk.bb
    shell:
        hatchet combine-counts --refversion hg19 --totalcounts output/count_reads/phased-snps/total.tsv --array output/count_reads --baffile output/baf/phased-snps/tumor.1bed --msr 3000 --mtr 5000 --outfile output/bb/phased-snps/bulk.bb 
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

vineetbansal commented 2 years ago

Hi @wir963 - it seems to me like your entire snakemake workflow is to run the HATCHet demo, and can start from scratch, i.e. it downloads all the inputs it needs from the web, and theoretically, it should be able to run end-end without needing any external inputs, is that correct?

If that is so, let me try to run your entire snakemake workflow - I'm sure I'll hit the same errors as you list here, and I can fix them faster that way. Also, if that is the case, once we've fixed all the remaining issues, I was wondering if you would consider sending a PR to HATCHet in the future so we can incorporate your Snakefile and other useful supplementary files (with proper attribution to you of course), directly into HATCHet for other users who might want to use the same setup as well. I'm imagining this will help a lot of other users too.

wir963 commented 2 years ago

Hey @vineetbansal,

Yep, that's the idea. It's a little weird now because I'm using the installed version of HATCHet instead of the conda version (conda and snakemake play very nicely together FWIW) but I think you should be able to figure that part out. I haven't done an end-to-end test recently but I just pushed the latest code and that should work for you - let me know if you run into any issues.

Snakemake and HPCs play nicely together too so you can probably use my submit-to-biowulf.sh script to do the same thing on your HPC with a few tweaks.

Yeah, happy to do a PR for that once it's working. You could even include it in your test suites if you wanted.

Best, Welles

wir963 commented 2 years ago

Hey @vineetbansal,

I'm just checking in on this issue. Were you able to run my code? Let me know if you have any questions - I'm happy to help.

Best, Welles

brian-arnold commented 2 years ago

Hi Welles, I'm looking into this and will report back asap! thanks for catching that corner case re. the phasing pipeline stripping the chr notation.

brian-arnold commented 2 years ago

Hi Welles,

To address your first comment, phased SNPs do actually follow the same conventions as unphased SNPs. I think what you're seeing is that b/c the reference panel we use does not use chr notation (i.e. chromosome 1 is "1" and not "chr1") but the input BAMs for demo do, we convert VCF files to have the same naming convention as the reference panel before phasing, removing the "chr" prefix if it exists. All VCFs are then phased (*_phased.vcf.gz in the phase directory; output of SHAPEIT), and then the chr prefix is added back in (*_toConcat.vcf.gz) before all the chromosome-specific VCF files are concatenated into one phased VCF for downstream use. So, the *_phased.vcf.gz files aren't named with "chr" but *_toConcat.vcf.gz are if this is how chromosomes were named in the input BAMs. Which files are you using for the count_alleles step?

Trying to reproduce your second error ASAP.

Brian

wir963 commented 2 years ago

Hey Brian,

It's actually the naming of the files that is causing the issue. My understanding is that if you pass --chromosomes then HATCHet expects the VCF files to be split by chromosome. However, like you said, "the chromosome-specific VCF files are concatenated into one phased VCF for downstream use" (for phase-snps, not for genotype-snps), which makes them incompatible with using the --chromosomes argument. Further, I don't think that genotype-snps and phase-snps should have different outputs. If you concatenate for phase-snps, why not concatenate for genotype-snps and vice-versa?

I am using the output of phase-snps (manually renamed to fix the above error) for the count_alleles step.

Best, Welles

brian-arnold commented 2 years ago

Hi Welles, specifying chromosomes="chr22" under the [run] section of the hatchet.ini file, I'm able to get the chr22 demo working end-to-end with custom install of master branch (using CBC as solver instead of Gurobi). This works with and without phasing, for me at least.

But you are correct in that combine_counts reads in the entire phase file and is not chromosome specific. This step is where hatchet is no longer specific to chromosomes as it eventually has to cluster bins across the entire genome. So we pass a VCF file that has info across all chromosomes processed up to that point in the pipeline.

I do agree that phase_snps and genotype_snps should probably have similar input and output. For instance, genotype_snps accepts --chromosome argument but phase_snps phases however many VCFs get produced by genotype SNPs, so the only way to parallelize it is to request a node with many cpus. Brian

raphael-group / hatchet

Error running combined-counts with phased SNPs #165