malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Validate results for phasing pipeline #73

Closed hardingnj closed 3 years ago

hardingnj commented 3 years ago

from @gbggrant

I've run the 167 Burkina Faso samples through the phasing pipeline. I haven't yet integrated your cohort_vcf_to_zarr script, but the VCFs for 2R and 3R are at: gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/Ag1000Phase2_BurkinaFaso_2R_phased.vcf.gz and gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/Ag1000Phase2_BurkinaFaso_3R_phased.vcf.gz - full outputs for the pipeline can be found at gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/ - let me know if you can't read these (I can download them to lustre).

@hardingnj to run H12 scan, and check we see signals of natural selection we expect.

hardingnj commented 3 years ago

Hi @gbggrant , just wondering who owns the gs://malariagen bucket? I can see files, but I can't download them.

Copying gs://malariagen/genetic_maps/ag1000g/phase2/AR1/X.gmap...               
AccessDeniedException: 403 HttpError accessing <https://storage.googleapis.com/download/storage/v1/b/malariagen/o/genetic_maps%2Fag1000g%2Fphase2%2FAR1%2F2R.gmap?generation=1607974239579471&alt=media>: response: <{'x-guploader-uploadid': 'ABg5-Uy2UzVpHMg0YylQMsF7pOCnc8EsXngQ6KQ2lfKA06AVXnb0CBXASi0b0kznLyBwe2wB6U9DuHSLvCuWWDb1zko', 'content-type': 'text/html; charset=UTF-8', 'date': 'Mon, 11 Jan 2021 14:17:04 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Mon, 11 Jan 2021 14:17:04 GMT', 'cache-control': 'private, max-age=0', 'content-length': '106', 'server': 'UploadServer', 'status': '403'}>, content <nicholas.harding@bdi.ox.ac.uk does not have storage.objects.get access to the Google Cloud Storage object.>
gbggrant commented 3 years ago

@hardingnj access is managed by a google group. 'malariagen' which you are in (under nicholas.harding@bdi.ox.ac.uk). Can you confirm that's the user you are trying to access the files as?

hardingnj commented 3 years ago

I'm pretty sure (unless I am misunderstanding something). That's the email in the error message above at least

If I switch to my personal account, I can't even ls:

(base) njh@debian:~$ gcloud auth list
        Credentialed Accounts
ACTIVE  ACCOUNT
        nicholas.harding@bdi.ox.ac.uk
*       nickharding@gmail.com

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

(base) njh@debian:~$ gsutil ls gs://malariagen

AccessDeniedException: 403 nickharding@gmail.com does not have storage.objects.list access to the Google Cloud Storage bucket.
gbggrant commented 3 years ago

okay - it must be something about the way we have it configured.

gbggrant commented 3 years ago

@hardingnj can you give it another try - we've just updated permissions here. No worries if you don't get to this until tomorrow.

hardingnj commented 3 years ago

Still a 403 unfortunately...

gbggrant commented 3 years ago

Dang. Using the nicholas.harding@bdi.ox.ac.uk I assume? (the gmail hasn't been added to the malariagen group).

hardingnj commented 3 years ago

Dang. Using the nicholas.harding@bdi.ox.ac.uk I assume? (the gmail hasn't been added to the malariagen group).

Unfortunately so!

gbggrant commented 3 years ago

I've copied the two VCFs to /lustre/scratch118/malaria/team112/personal/gg18/phasing-validation - let you get started while we figure out the access issues with gs://malariagen/

gbggrant commented 3 years ago

Hi again @hardingnj - we've updated read permissions on that bucket (gs://malariagen) again. Can you try to download data again and see if it works?

hardingnj commented 3 years ago

That's done it. Thanks!

gbggrant commented 3 years ago

Great! Sorry for the problems.

hardingnj commented 3 years ago

Results look good. A more detailed PR is referenced below in vector-ops

Chromosome plots of the h12 summary statistic. image

The phase 2 plots of the GSTE locus (top) compared to the new pipeline (bottom) image

alimanfoo commented 3 years ago

Hi all, just to confirm that the results from @hardingnj are excellent. Previous analyses replicate extremely well with the haplotypes from the new pipeline. I think this gives us all the confidence we need to go ahead with the new pipeline. :champagne:

hardingnj commented 3 years ago

Great- excited to see this going so well- looking forward to analysing the whole cohort!

alimanfoo commented 3 years ago

Reopening this issue to record results from a second round of validation against the pipeline with genome region scatter and ligation.

From @jessicaway on 1 April:

The phasing outputs for the 167 sample validation set are available at gs://malariagen/Phasing/validation/Ag1000Phase2_BurkinaFaso/ (these were run with the genome region scatter and ligation).

I've rerun the validation analysis against these new results and everything looks great. Here's the results of H12 selection scans, comparing the new pipeline ("dev-release-2") against the Ag1000G phase 2 haplotypes:

image

image

Happy to sign off on the new pipeline implementation, sorry it took so long!

alimanfoo commented 3 years ago

xref https://github.com/malariagen/vector-ops/pull/1402