VCF annotation using VEP - [merged]

ManavalanG commented 3 years ago

_Merges vepannotation -> master

Annotates variants in VCF using Variant Effect Predictor (VEP).

The following may need review at some level. Note this other stuff may need review as well, as this is not exhaustive.

[x] Snakemake pipeline
[x] Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable.
[x] Documentation. I kept it simple; let me know if it needs more info.
[x] Datasets used and their versions

ManavalanG commented 3 years ago

added 1 commit

cc59824e - updates clinvar path

Compare with previous version

ManavalanG commented 3 years ago

added 1 commit

b18b4edf - adds annotated test vcf

Compare with previous version

ManavalanG commented 3 years ago

We currently obtain PolyPhen and SIFT scores vis dbNSFP. However VEP is capable of natively annotating them without extra work via options --sift, --polyphen, and they recommend as much as well as dbNSFP includes only the non-synonymous variants.

Should we switch?

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 27, 2021, 15:35

yes, that's totally fine and I think we should use those options. The originating source for that info doesn't have to be dnNSFP, it just happens to be the most convenient way to obtain that info in a bulk download.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 09:14

Commented on variant_annotation/configs/env/vep.yaml line 7

Is this particular version of BCFTools needed? I've been using 1.10.2 which has a big bug fix (https://github.com/samtools/bcftools/releases/tag/1.10.2) as well as the 1.10 release has a bunch of bug fixes in it too. 1.11 is available now as well but that mostly appears to be a feature enhancement release.
Is this particular version of tabix needed? BCFTools comes with tabix because it uses tabix indexes for some of its processing commands. I'd recommend using the accordingly packaged version of tabix to avoid issues unless absolutely needed.

ManavalanG commented 3 years ago

changed this line in version 4 of the diff

ManavalanG commented 3 years ago

added 7 commits

665db894 - fixes resources config bug
c06137e4 - uses VEP's sift and polyphen; changes threads to 8
68723b88 - adds warning file
5b5bec96 - breaks long strings to multi-lines
b737497d - bumps cluster partition; update test output
d8368088 - bgzips output vcf
5a982790 - updates conda env

Compare with previous version

ManavalanG commented 3 years ago

bcftools v1.11 has conflict with VEP and after noticing this, I just went back to my older setting which was v1.9. I tested v1.10.2 just now and it works without conflict.
Good catch. Checking it now, even VEP includes tabix, which I didn't realize as they recommend installing tabix separately in their documentation. Bioconda recipe probably just chose to include them I guess.

ManavalanG commented 3 years ago

Changed them now.

ManavalanG commented 3 years ago

Also made these changes recently:

Bumped annotation to use 8 instead of 4 threads (#1)
Made the output file bgzipped - as unzipped files are massive (~12GB) and they may be I/O expensive
Upgraded cluster partition to use
Style updates and minor improvements.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

Commented on variant_annotation/configs/env/vep.yaml line 7

Nice, I figured it was something like that but thought it was worth checking.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

resolved all threads

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 136

            | bcftools view -Oz \

switch to using bcftools to generate bgzip output

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 137

remove as bcftools can do compression directly

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 115

        bcftools view {input.calls} | \

removing accidental parentheses

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 138

            > {output.calls}

removing accidental parentheses

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable. as completed

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Datasets used and their versions as completed

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:17

@ManavalanG I made #3 to remind us to consolidate the repo structure once major components are all merged.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:21

I think it'd be best to move datasets.yaml out of this repo and make it a user config file with a hardcoded path like ~/.ditto_datasets.yaml and then layout instructions on its format in the README. That way we won't be sharing any of our internal lab file structure when making the repo public.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:27

I'm torn on how much info about the custom datasets needs to be distributed. Specifically the custom formatting done for GERP and dbNSFP usage. I do not think we need to go crazy on that, but maybe just put a short description on how others could produce them. Thoughts @ManavalanG ?

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:28

I guess adding version numbers of external datasets would be good to put in the README with this change so we know what version was used when and the expected file format.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 28, 2021, 17:30

@ManavalanG I'm done with the initial review! I'll re-review when the update to the Snakefile is made to make the input VCF configurable and then ping again when I change the run script to handle commandline specification of input VCF and local vs slurm job execution.

ManavalanG commented 3 years ago

changed this line in version 5 of the diff

ManavalanG commented 3 years ago

changed this line in version 5 of the diff

ManavalanG commented 3 years ago

changed this line in version 5 of the diff

ManavalanG commented 3 years ago

changed this line in version 5 of the diff

ManavalanG commented 3 years ago

added 2 commits

458bd8dc - refactors vep to directly write bgzipped output file
4d17bff0 - accepts input vcf via config

Compare with previous version

ManavalanG commented 3 years ago

Switched to VEP writing output file directly.

ManavalanG commented 3 years ago

Switched to VEP writing output file directly.

ManavalanG commented 3 years ago

Removed.

ManavalanG commented 3 years ago

This part is refactored now.

ManavalanG commented 3 years ago

when the update to the Snakefile is made to make the input VCF configurable

This part is done now.

ManavalanG commented 3 years ago

Yeah that sounds good to me.

dbNSFP formatting we adopted is quite similar to what VEP (plugin) folks suggested. So I think we can just point to that.
For GERP, we can just mention the command used for processing.

ManavalanG commented 3 years ago

added 3 commits

e18867d2 - adds test output
e77ea1f0 - inputs datasets config via cli
5531cfd2 - downgrades cluster partition for annotation

Compare with previous version

ManavalanG commented 3 years ago

I agree with the first part and I like the idea. Now it needs to be supplied to snakemake via CLI.

For the second part involving README, I'm conflicted. My worry is that it is easy to forget to update README as and when changes are made. But I do see the value of storing it somewhere. Any other way we can track this?

ManavalanG commented 3 years ago

added 1 commit

a5d130ce - removes unused code

Compare with previous version

ManavalanG commented 3 years ago

How about using symlinked files instead? We can store relevant info this way but don't have to remember to update readme.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 08:13

I'm an idiot and don't know why I put "we" in the above comment. I was thinking documenting the version with the formatting (i.e. just use default file versus custom format) on a basic level for making the repo public and making the work reproducible for others. I should've just combined this with the comment about custom datasets documentation.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 09:31

added 1 commit

7f5c3d3c - making changes to add CLI specification of input info

Compare with previous version

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 09:34

added 1 commit

a098b533 - fixing stupid mistake in CLI arg check

Compare with previous version

ManavalanG commented 3 years ago

This is nice. We may also want to add another arg that can be used to pass customs args to snakemake command. For example, -n, --unlock, etc.

ManavalanG commented 3 years ago

I would recommend adding set -euo pipefail (or similar) to catch some unexpected errors upfront.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 11:15

Commented on variant_annotation/src/run_pipeline.sh line 21

Let's make that a stretch goal. I get the benefits of it, but don't fully know the complications of implementing in bash right now. For now we can do this by hand with Snakemake if really necessary and that's good enough to get by for now. Does that work?

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

added 1 commit

2c93726a - minor updates based on recommendations, added some dataset info to readme

Compare with previous version

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

Commented on variant_annotation/src/run_pipeline.sh line 7

good point, added it.

ManavalanG commented 3 years ago

Sounds good.

ManavalanG commented 3 years ago

As per discussion with @wilkb777, presence of dataset config in README is meant to serve only as an example of format required for this config and not as documentation for which datasets (or its versions) were used by the pipeline.

@wilkb777 - Feel free to add more info if needed. Closing this now :)

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 12:05

marked the checklist item Snakemake pipeline as completed

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 29, 2021, 12:07

I added a section to the README now, give it a look and let me know if you think it's good enough for this

uab-cgds-worthey / DITTO

VCF annotation using VEP - [merged] #11