harrispopgen / mushi-pipelines

Data analysis pipelines for the mushi paper
MIT License
0 stars 0 forks source link

Demographies from published paper #2

Open Schraiber opened 5 days ago

Schraiber commented 5 days ago

I'm hoping that it'd be possible to have the demographies inferred in the PNAS paper on the github so that I can use them in some downstream analyses? I don't think I see them on here.

wsdewitt commented 5 days ago

Hi, Josh. The artifacts for the PNAS paper were generated using Nextflow pipelines in another repo: https://github.com/harrispopgen/mushi-pipelines

image image

The pipeline outputs pickled mushi history objects, and those are read by the plotting notebook to generate plots. The pickled objects are not git tracked, so if you need them, you could try running the nextflow pipeline, which will produce an output/ directory with all these intermediate results.

Schraiber commented 5 days ago

Great I'll give this a shot!

Schraiber commented 5 days ago

I've put a bit of effort into fixing some deprecation errors and I think I've hit a point where I can't seem to figure out what's going on.

I get to this point:

[74/4a1675] process > mask (22)           [100%] 22 of 22 ✔                                                                                                      
[cd/9f155c] process > ancestor (22)       [100%] 22 of 22 ✔                                                                                                      
[a2/8c1a06] process > masked_size (8)     [  0%] 0 of 22                                                                                                         
[-        ] process > masked_size_total   -                                                                                                                      
[d4/0271c3] process > mutation_types (8)  [  0%] 0 of 22                                                                                                         
[-        ] process > ksfs                -                                                                                                                      
[-        ] process > ksfs_total          -                                                                                                                      
[-        ] process > eta_sweep           -                                                                                                                      
[-        ] process > mu_sweep            -                                                                                                                      
[-        ] process > bootstrap           -                                                                                                                      
[-        ] process > europulse           -                                                                                                                      
[55/ef6f42] process > eta_Tennessen (2)   [100%] 2 of 2 ✔                                                                                                        
[6e/1134ad] process > eta_Relate (5)      [100%] 5 of 5 ✔                                                                                                        
[-        ] process > europulse_Tennessen -                                                                                                                      
[-        ] process > europulse_Relate    -                                                                                                                      
[-        ] process > mush_ref            -                                                                                                                      
[-        ] process > mush                -                                                                                                                      
executor >  slurm (68)                                                                                                                                           
[74/4a1675] process > mask (22)           [100%] 22 of 22 ✔                                                                                                      
[cd/9f155c] process > ancestor (22)       [100%] 22 of 22 ✔
[09/7ce655] process > masked_size (9)     [  0%] 0 of 13
[-        ] process > masked_size_total   -
[d4/0271c3] process > mutation_types (8)  [ 22%] 4 of 18, failed: 4
[-        ] process > ksfs                -
[-        ] process > ksfs_total          -
[-        ] process > eta_sweep           -
[-        ] process > mu_sweep            -
[-        ] process > bootstrap           -
[-        ] process > europulse           -
[55/ef6f42] process > eta_Tennessen (2)   [100%] 2 of 2 ✔
[6e/1134ad] process > eta_Relate (5)      [100%] 5 of 5 ✔                                                                                                        
[-        ] process > europulse_Tennessen -
[-        ] process > europulse_Relate    -
[-        ] process > mush_ref            -
[-        ] process > mush                -
[-        ] process > mush_ref_folded     -
[-        ] process > mush_folded         -
Staging foreign file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

at which point I get the error:

Error executing process > 'mutation_types (4)'

Caused by:
  Process `mutation_types (4)` terminated with an error exit status (255)

Command executed:

  # NOTE: sample NA18498 is missing in hg38 release of low coverage data, see https://www.biorxiv.org/content/10.1101/600254v1
  # tail -n +2 integrated_call_samples.tsv | cut -f1 | grep -v NA18498 > all_samples.txt
  tail -n +2 integrated_call_samples.tsv | cut -f1 > all_samples.txt
  bcftools view -S all_samples.txt -c 1:minor -R mask.bed -m2 -M2 -v snps -f PASS -Ou snps.vcf.gz | bcftools view -g ^miss -Ou | mutyper variants ancestor.fa - --strict --k 3 | bcftools convert -Oz -o mutation_types.vcf.gz

Command exit status:
  255

Command output:
  (empty)

Command error:
  [E::hts_open_format] Failed to open file "snps.vcf.gz" : No such file or directory
  Failed to read from snps.vcf.gz: No such file or directory
  Failed to read from standard input: unknown file type
  Traceback (most recent call last):
    File "/home1/schraibe/.conda/envs/1KG/bin/mutyper", line 8, in <module>
      sys.exit(main())
    File "/home1/schraibe/.conda/envs/1KG/lib/python3.8/site-packages/mutyper/cli.py", line 474, in main
      args.func(args)
    File "/home1/schraibe/.conda/envs/1KG/lib/python3.8/site-packages/mutyper/cli.py", line 150, in variants
      vcf = cyvcf2.VCF(args.vcf)
    File "cyvcf2/cyvcf2.pyx", line 278, in cyvcf2.cyvcf2.VCF.__init__
    File "cyvcf2/cyvcf2.pyx", line 211, in cyvcf2.cyvcf2.HTSFile._open_htsfile
  OSError: - is not valid bcf or vcf (format: 15 mode: r)
  Failed to open -: unknown file type

Work dir:
  /scratch1/schraibe/mushi-pipelines/1KG/work/d4/926b615cb9407e0a3cd579d31ee431

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

It looks like snps.vcf.gz is missing (which it is!). I've not actually used nextflow before but I don't see a command where it's being created? and I'm not sure what's supposed to be in there, so I'm not sure how to recreate it.

wsdewitt commented 4 days ago

I think this has to do with where the 30x 1KGP VCFs live. What do you have on this line?https://github.com/harrispopgen/mushi-pipelines/blob/222be50ac80c05db390916c46302486ad2ee8602/1KG/1KG.nf#L4 This is our local path to the VCFs. They are large, so we copied them locally to avoid repeatedly streaming them in pipeline reruns. This path variable shows up a few lines down where we define a pattern for the VCF file paths: https://github.com/harrispopgen/mushi-pipelines/blob/222be50ac80c05db390916c46302486ad2ee8602/1KG/1KG.nf#L29-L33 So you'll need to download the 30x 1KGP VCFs that look like that pattern from here, or adjust the pipeline to stream them in if you prefer. Note that we wrote this around an early 30x release (we cite the preprint in ref 28 of the PNAS paper), so the latest 30x VCF release may be different from what we used in our pipeline.

Schraiber commented 1 day ago

Ah I see, I should download the VCFs myself. that makes sense. I actually did change that line but I figured the downloading was part of the pipeline or something. I'll try to get this fixed up. Thanks for your quick responses!