Open wir963 opened 2 years ago
Hi @wir963 - it seems to me like your entire snakemake
workflow is to run the HATCHet
demo, and can start from scratch, i.e. it downloads all the inputs it needs from the web, and theoretically, it should be able to run end-end without needing any external inputs, is that correct?
If that is so, let me try to run your entire snakemake
workflow - I'm sure I'll hit the same errors as you list here, and I can fix them faster that way. Also, if that is the case, once we've fixed all the remaining issues, I was wondering if you would consider sending a PR to HATCHet
in the future so we can incorporate your Snakefile
and other useful supplementary files (with proper attribution to you of course), directly into HATCHet
for other users who might want to use the same setup as well. I'm imagining this will help a lot of other users too.
Hey @vineetbansal,
Yep, that's the idea. It's a little weird now because I'm using the installed version of HATCHet instead of the conda version (conda and snakemake play very nicely together FWIW) but I think you should be able to figure that part out. I haven't done an end-to-end test recently but I just pushed the latest code and that should work for you - let me know if you run into any issues.
Snakemake and HPCs play nicely together too so you can probably use my submit-to-biowulf.sh
script to do the same thing on your HPC with a few tweaks.
Yeah, happy to do a PR for that once it's working. You could even include it in your test suites if you wanted.
Best, Welles
Hey @vineetbansal,
I'm just checking in on this issue. Were you able to run my code? Let me know if you have any questions - I'm happy to help.
Best, Welles
Hi Welles, I'm looking into this and will report back asap! thanks for catching that corner case re. the phasing pipeline stripping the chr notation.
Hi Welles,
To address your first comment, phased SNPs do actually follow the same conventions as unphased SNPs. I think what you're seeing is that b/c the reference panel we use does not use chr notation (i.e. chromosome 1 is "1" and not "chr1") but the input BAMs for demo do, we convert VCF files to have the same naming convention as the reference panel before phasing, removing the "chr" prefix if it exists. All VCFs are then phased (*_phased.vcf.gz in the phase directory; output of SHAPEIT), and then the chr prefix is added back in (*_toConcat.vcf.gz) before all the chromosome-specific VCF files are concatenated into one phased VCF for downstream use. So, the *_phased.vcf.gz files aren't named with "chr" but *_toConcat.vcf.gz are if this is how chromosomes were named in the input BAMs. Which files are you using for the count_alleles step?
Trying to reproduce your second error ASAP.
Brian
Hey Brian,
It's actually the naming of the files that is causing the issue. My understanding is that if you pass --chromosomes
then HATCHet expects the VCF files to be split by chromosome. However, like you said, "the chromosome-specific VCF files are concatenated into one phased VCF for downstream use" (for phase-snps
, not for genotype-snps
), which makes them incompatible with using the --chromosomes
argument. Further, I don't think that genotype-snps
and phase-snps
should have different outputs. If you concatenate for phase-snps
, why not concatenate for genotype-snps
and vice-versa?
I am using the output of phase-snps
(manually renamed to fix the above error) for the count_alleles
step.
Best, Welles
Hi Welles,
specifying chromosomes="chr22"
under the [run] section of the hatchet.ini file, I'm able to get the chr22 demo working end-to-end with custom install of master branch (using CBC as solver instead of Gurobi). This works with and without phasing, for me at least.
But you are correct in that combine_counts reads in the entire phase file and is not chromosome specific. This step is where hatchet is no longer specific to chromosomes as it eventually has to cluster bins across the entire genome. So we pass a VCF file that has info across all chromosomes processed up to that point in the pipeline.
I do agree that phase_snps and genotype_snps should probably have similar input and output. For instance, genotype_snps accepts --chromosome argument but phase_snps phases however many VCFs get produced by genotype SNPs, so the only way to parallelize it is to request a node with many cpus. Brian
Hey,
When I try to run the demo using Snakemake rules, I get two errors.
The first error is for
count_alleles
and it basically complains that the phased SNPs don't follow the same standard as the unphased SNPs (they aren't named by chromosomes, which is apparently required when using--chromosomes
). I fixed this by manually renaming the phased snps file.The second error is for
combine_counts
.