Open Schraiber opened 5 days ago
Hi, Josh. The artifacts for the PNAS paper were generated using Nextflow pipelines in another repo: https://github.com/harrispopgen/mushi-pipelines
The pipeline outputs pickled mushi history objects, and those are read by the plotting notebook to generate plots. The pickled objects are not git tracked, so if you need them, you could try running the nextflow pipeline, which will produce an output/
directory with all these intermediate results.
Great I'll give this a shot!
I've put a bit of effort into fixing some deprecation errors and I think I've hit a point where I can't seem to figure out what's going on.
I get to this point:
[74/4a1675] process > mask (22) [100%] 22 of 22 ✔
[cd/9f155c] process > ancestor (22) [100%] 22 of 22 ✔
[a2/8c1a06] process > masked_size (8) [ 0%] 0 of 22
[- ] process > masked_size_total -
[d4/0271c3] process > mutation_types (8) [ 0%] 0 of 22
[- ] process > ksfs -
[- ] process > ksfs_total -
[- ] process > eta_sweep -
[- ] process > mu_sweep -
[- ] process > bootstrap -
[- ] process > europulse -
[55/ef6f42] process > eta_Tennessen (2) [100%] 2 of 2 ✔
[6e/1134ad] process > eta_Relate (5) [100%] 5 of 5 ✔
[- ] process > europulse_Tennessen -
[- ] process > europulse_Relate -
[- ] process > mush_ref -
[- ] process > mush -
executor > slurm (68)
[74/4a1675] process > mask (22) [100%] 22 of 22 ✔
[cd/9f155c] process > ancestor (22) [100%] 22 of 22 ✔
[09/7ce655] process > masked_size (9) [ 0%] 0 of 13
[- ] process > masked_size_total -
[d4/0271c3] process > mutation_types (8) [ 22%] 4 of 18, failed: 4
[- ] process > ksfs -
[- ] process > ksfs_total -
[- ] process > eta_sweep -
[- ] process > mu_sweep -
[- ] process > bootstrap -
[- ] process > europulse -
[55/ef6f42] process > eta_Tennessen (2) [100%] 2 of 2 ✔
[6e/1134ad] process > eta_Relate (5) [100%] 5 of 5 ✔
[- ] process > europulse_Tennessen -
[- ] process > europulse_Relate -
[- ] process > mush_ref -
[- ] process > mush -
[- ] process > mush_ref_folded -
[- ] process > mush_folded -
Staging foreign file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
at which point I get the error:
Error executing process > 'mutation_types (4)'
Caused by:
Process `mutation_types (4)` terminated with an error exit status (255)
Command executed:
# NOTE: sample NA18498 is missing in hg38 release of low coverage data, see https://www.biorxiv.org/content/10.1101/600254v1
# tail -n +2 integrated_call_samples.tsv | cut -f1 | grep -v NA18498 > all_samples.txt
tail -n +2 integrated_call_samples.tsv | cut -f1 > all_samples.txt
bcftools view -S all_samples.txt -c 1:minor -R mask.bed -m2 -M2 -v snps -f PASS -Ou snps.vcf.gz | bcftools view -g ^miss -Ou | mutyper variants ancestor.fa - --strict --k 3 | bcftools convert -Oz -o mutation_types.vcf.gz
Command exit status:
255
Command output:
(empty)
Command error:
[E::hts_open_format] Failed to open file "snps.vcf.gz" : No such file or directory
Failed to read from snps.vcf.gz: No such file or directory
Failed to read from standard input: unknown file type
Traceback (most recent call last):
File "/home1/schraibe/.conda/envs/1KG/bin/mutyper", line 8, in <module>
sys.exit(main())
File "/home1/schraibe/.conda/envs/1KG/lib/python3.8/site-packages/mutyper/cli.py", line 474, in main
args.func(args)
File "/home1/schraibe/.conda/envs/1KG/lib/python3.8/site-packages/mutyper/cli.py", line 150, in variants
vcf = cyvcf2.VCF(args.vcf)
File "cyvcf2/cyvcf2.pyx", line 278, in cyvcf2.cyvcf2.VCF.__init__
File "cyvcf2/cyvcf2.pyx", line 211, in cyvcf2.cyvcf2.HTSFile._open_htsfile
OSError: - is not valid bcf or vcf (format: 15 mode: r)
Failed to open -: unknown file type
Work dir:
/scratch1/schraibe/mushi-pipelines/1KG/work/d4/926b615cb9407e0a3cd579d31ee431
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
It looks like snps.vcf.gz is missing (which it is!). I've not actually used nextflow before but I don't see a command where it's being created? and I'm not sure what's supposed to be in there, so I'm not sure how to recreate it.
I think this has to do with where the 30x 1KGP VCFs live. What do you have on this line?https://github.com/harrispopgen/mushi-pipelines/blob/222be50ac80c05db390916c46302486ad2ee8602/1KG/1KG.nf#L4 This is our local path to the VCFs. They are large, so we copied them locally to avoid repeatedly streaming them in pipeline reruns. This path variable shows up a few lines down where we define a pattern for the VCF file paths: https://github.com/harrispopgen/mushi-pipelines/blob/222be50ac80c05db390916c46302486ad2ee8602/1KG/1KG.nf#L29-L33 So you'll need to download the 30x 1KGP VCFs that look like that pattern from here, or adjust the pipeline to stream them in if you prefer. Note that we wrote this around an early 30x release (we cite the preprint in ref 28 of the PNAS paper), so the latest 30x VCF release may be different from what we used in our pipeline.
Ah I see, I should download the VCFs myself. that makes sense. I actually did change that line but I figured the downloading was part of the pipeline or something. I'll try to get this fixed up. Thanks for your quick responses!
I'm hoping that it'd be possible to have the demographies inferred in the PNAS paper on the github so that I can use them in some downstream analyses? I don't think I see them on here.