Open tjcamp opened 2 years ago
/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 2--- /
Investigate CYP2D6 Pharmacogenetic Variation:
Involved in metabolism of nearly 25% of clinically prescribed medications Highly polymorphic gene The Pharmacogene Variation Consortium (PharmVar) has cataloged over 140 CYP2D6 haplotypes Aim - Use the Variant Effect Predictor tool to find the effects of SNPs in the CYP2D6 gene.
Dataset: https://www.nature.com/articles/s41397-020-00205-5 (Using Illumina reads to study the effects of SNPs on the CYP2D6 gene.)
Envisioned workflow: 1] Extract Illumina reads 2] Indexing and converting files to BAM format using BWA 3] Mapping samples with reference genome 3] Variant calling (SNPs) - BCFtools mileup 4] Interpret, analyse and annotate variants with Ensembl VEP.
References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html
/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 3--- /
Investigate CYP2D6 Pharmacogenetic Variation:
Updates for this week:
Created script:
Pushed this script to github
The script ran for over 2 hours and had to cancel due to time constraints. Need help to confirm if the raw data used is correct. (Email sent for the same)
Studied on how to use and analyze data with variant effect predictor.
Ran script to perform variant calling for the raw data (illumina reads) with the accession numbers as follows: 1000 Genomes Project phase 3: 30X coverage whole genome sequencing [30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples.] Experiment accession no: ERX3270178 (https://www.ncbi.nlm.nih.gov/sra/ERX3270178[accn]): Illumina NovaSeq 6000 paired end sequencing
Corrections made to script from last week: ERR3243163 - Will not be using this data as it seems unusually large. Updated ploidy to 2 in script
Result / Problems:
Further steps:
Script issue no - d924360eba512f9e4576d50a099ecbd3fb66100b
bcftools call
. The --ploidy
option is not to specify the number of ploidy, but used to specify regions that should be treated as haploid (like sexual). See below for details:
(base) jc33471@teach-sub1 final_project$ bcftools call --ploidy ?
PRE-DEFINED PLOIDY FILES
GRCh37 .. Human Genome reference assembly GRCh37 / hg19
X 1 60000 M 1 X 2699521 154931043 M 1 Y 1 59373566 M 1 Y 1 59373566 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 60000 M 1 chrX 2699521 154931043 M 1 chrY 1 59373566 M 1 chrY 1 59373566 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1
GRCh38 .. Human Genome reference assembly GRCh38 / hg38
X 1 9999 M 1 X 2781480 155701381 M 1 Y 1 57227415 M 1 Y 1 57227415 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 9999 M 1 chrX 2781480 155701381 M 1 chrY 1 57227415 M 1 chrY 1 57227415 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1
X .. Treat male samples as haploid and female as diploid regardless of the chromosome name
Y .. Treat male samples as haploid and female as no-copy, regardless of the chromosome name
1 .. Treat all samples as haploid
Run as --ploidy
- It seems that you are working with human genome GRCh38, is it correct? If so, I think you should use `--ploidy GRCh38` instead of `--ploidy 2`. Or eliminate this parameter to treat complete genome as diploid.
/ --- PROJECT - VARIANT PREDICTION IN CYP2D6--- /
Investigate CYP2D6 Pharmacogenetic Variation:
Paper & dataset: Nanopore sequencing of the pharmacogene CYP2D6 allows simultaneous haplotyping and detection of duplications - https://pubmed.ncbi.nlm.nih.gov/31559921/ Reference sequence - NG_008376.3 (8593 bps) Sorted BAM files - https://github.com/yusmiatiliau/CYP2D_reference_BAM
Envisioned workflow: 1] Convert available sample BAM files to fastq with samtools fastq 2] Mapping the samples with reference genome using minimap2, NGMLR & BWA 3] Variant calling - BCFtools mpileup (although I have read this does not work well with long read nanopore sequenced samples), Clairvoyante & Nanopolish. 4] Compare results with reference paper (they have used clairvoyante and nanopolish) 5] Interpret, analyse and annotate variants with Ensembl VEP.
References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html Using VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Nanopolish - https://nanopolish.readthedocs.io/en/latest/manual.html Clairvoyante - https://github.com/aquaskyline/Clairvoyante
NOTE: Since the current paper I selected uses Oxford nanopre sequencing data, here is another paper whose raw data uses Illumina reads. https://www.nature.com/articles/s41397-020-00205-5