Closed ManavalanG closed 1 year ago
We currently obtain PolyPhen and SIFT scores vis dbNSFP. However VEP is capable of natively annotating them without extra work via options --sift, --polyphen
, and they recommend as much as well as dbNSFP includes only the non-synonymous variants.
Should we switch?
In GitLab by @wilkb777 on Jan 27, 2021, 15:35
yes, that's totally fine and I think we should use those options. The originating source for that info doesn't have to be dnNSFP, it just happens to be the most convenient way to obtain that info in a bulk download.
In GitLab by @wilkb777 on Jan 28, 2021, 09:14
Commented on variant_annotation/configs/env/vep.yaml line 7
1.10.2
which has a big bug fix (https://github.com/samtools/bcftools/releases/tag/1.10.2) as well as the 1.10
release has a bunch of bug fixes in it too. 1.11
is available now as well but that mostly appears to be a feature enhancement release.changed this line in version 4 of the diff
added 7 commits
v1.11
has conflict with VEP and after noticing this, I just went back to my older setting which was v1.9
. I tested v1.10.2
just now and it works without conflict. Changed them now.
Also made these changes recently:
In GitLab by @wilkb777 on Jan 28, 2021, 12:29
Commented on variant_annotation/configs/env/vep.yaml line 7
Nice, I figured it was something like that but thought it was worth checking.
In GitLab by @wilkb777 on Jan 28, 2021, 12:29
resolved all threads
In GitLab by @wilkb777 on Jan 28, 2021, 17:05
Commented on variant_annotation/src/Snakefile line 136
| bcftools view -Oz \
switch to using bcftools to generate bgzip output
In GitLab by @wilkb777 on Jan 28, 2021, 17:05
Commented on variant_annotation/src/Snakefile line 137
remove as bcftools can do compression directly
In GitLab by @wilkb777 on Jan 28, 2021, 17:06
Commented on variant_annotation/src/Snakefile line 115
bcftools view {input.calls} | \
removing accidental parentheses
In GitLab by @wilkb777 on Jan 28, 2021, 17:06
Commented on variant_annotation/src/Snakefile line 138
> {output.calls}
removing accidental parentheses
In GitLab by @wilkb777 on Jan 28, 2021, 17:12
marked the checklist item Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable. as completed
In GitLab by @wilkb777 on Jan 28, 2021, 17:12
marked the checklist item Datasets used and their versions as completed
In GitLab by @wilkb777 on Jan 28, 2021, 17:17
@ManavalanG I made #3 to remind us to consolidate the repo structure once major components are all merged.
In GitLab by @wilkb777 on Jan 28, 2021, 17:21
I think it'd be best to move datasets.yaml
out of this repo and make it a user config file with a hardcoded path like ~/.ditto_datasets.yaml
and then layout instructions on its format in the README. That way we won't be sharing any of our internal lab file structure when making the repo public.
In GitLab by @wilkb777 on Jan 28, 2021, 17:27
I'm torn on how much info about the custom datasets needs to be distributed. Specifically the custom formatting done for GERP and dbNSFP usage. I do not think we need to go crazy on that, but maybe just put a short description on how others could produce them. Thoughts @ManavalanG ?
In GitLab by @wilkb777 on Jan 28, 2021, 17:28
I guess adding version numbers of external datasets would be good to put in the README with this change so we know what version was used when and the expected file format.
In GitLab by @wilkb777 on Jan 28, 2021, 17:30
@ManavalanG I'm done with the initial review! I'll re-review when the update to the Snakefile is made to make the input VCF configurable and then ping again when I change the run script to handle commandline specification of input VCF and local vs slurm job execution.
changed this line in version 5 of the diff
changed this line in version 5 of the diff
changed this line in version 5 of the diff
changed this line in version 5 of the diff
added 2 commits
Switched to VEP writing output file directly.
Switched to VEP writing output file directly.
Removed.
This part is refactored now.
when the update to the Snakefile is made to make the input VCF configurable
This part is done now.
Yeah that sounds good to me.
added 3 commits
I agree with the first part and I like the idea. Now it needs to be supplied to snakemake via CLI.
For the second part involving README, I'm conflicted. My worry is that it is easy to forget to update README as and when changes are made. But I do see the value of storing it somewhere. Any other way we can track this?
How about using symlinked files instead? We can store relevant info this way but don't have to remember to update readme.
In GitLab by @wilkb777 on Jan 29, 2021, 08:13
I'm an idiot and don't know why I put "we" in the above comment. I was thinking documenting the version with the formatting (i.e. just use default file versus custom format) on a basic level for making the repo public and making the work reproducible for others. I should've just combined this with the comment about custom datasets documentation.
In GitLab by @wilkb777 on Jan 29, 2021, 09:31
added 1 commit
In GitLab by @wilkb777 on Jan 29, 2021, 09:34
added 1 commit
This is nice. We may also want to add another arg that can be used to pass customs args to snakemake command. For example, -n
, --unlock
, etc.
I would recommend adding set -euo pipefail
(or similar) to catch some unexpected errors upfront.
In GitLab by @wilkb777 on Jan 29, 2021, 11:15
Commented on variant_annotation/src/run_pipeline.sh line 21
Let's make that a stretch goal. I get the benefits of it, but don't fully know the complications of implementing in bash right now. For now we can do this by hand with Snakemake if really necessary and that's good enough to get by for now. Does that work?
In GitLab by @wilkb777 on Jan 29, 2021, 11:16
added 1 commit
In GitLab by @wilkb777 on Jan 29, 2021, 11:16
Commented on variant_annotation/src/run_pipeline.sh line 7
good point, added it.
Sounds good.
As per discussion with @wilkb777, presence of dataset config in README is meant to serve only as an example of format required for this config and not as documentation for which datasets (or its versions) were used by the pipeline.
@wilkb777 - Feel free to add more info if needed. Closing this now :)
In GitLab by @wilkb777 on Jan 29, 2021, 12:05
marked the checklist item Snakemake pipeline as completed
In GitLab by @wilkb777 on Jan 29, 2021, 12:07
I added a section to the README now, give it a look and let me know if you think it's good enough for this
_Merges vepannotation -> master
Annotates variants in VCF using Variant Effect Predictor (VEP).
The following may need review at some level. Note this other stuff may need review as well, as this is not exhaustive.