tjcamp / BINF8940

BINF8940 Class Repository
0 stars 0 forks source link

Final_Project #6

Open tjcamp opened 1 year ago

tjcamp commented 1 year ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Paper & dataset: Nanopore sequencing of the pharmacogene CYP2D6 allows simultaneous haplotyping and detection of duplications - https://pubmed.ncbi.nlm.nih.gov/31559921/ Reference sequence - NG_008376.3 (8593 bps) Sorted BAM files - https://github.com/yusmiatiliau/CYP2D_reference_BAM

Envisioned workflow: 1] Convert available sample BAM files to fastq with samtools fastq 2] Mapping the samples with reference genome using minimap2, NGMLR & BWA 3] Variant calling - BCFtools mpileup (although I have read this does not work well with long read nanopore sequenced samples), Clairvoyante & Nanopolish. 4] Compare results with reference paper (they have used clairvoyante and nanopolish) 5] Interpret, analyse and annotate variants with Ensembl VEP.

References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html Using VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Nanopolish - https://nanopolish.readthedocs.io/en/latest/manual.html Clairvoyante - https://github.com/aquaskyline/Clairvoyante

NOTE: Since the current paper I selected uses Oxford nanopre sequencing data, here is another paper whose raw data uses Illumina reads. https://www.nature.com/articles/s41397-020-00205-5

cbergman commented 1 year ago
tjcamp commented 1 year ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 2--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Involved in metabolism of nearly 25% of clinically prescribed medications Highly polymorphic gene The Pharmacogene Variation Consortium (PharmVar) has cataloged over 140 CYP2D6 haplotypes Aim - Use the Variant Effect Predictor tool to find the effects of SNPs in the CYP2D6 gene.

Dataset: https://www.nature.com/articles/s41397-020-00205-5 (Using Illumina reads to study the effects of SNPs on the CYP2D6 gene.)

Envisioned workflow: 1] Extract Illumina reads 2] Indexing and converting files to BAM format using BWA 3] Mapping samples with reference genome 3] Variant calling (SNPs) - BCFtools mileup 4] Interpret, analyse and annotate variants with Ensembl VEP.

References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html

tjcamp commented 1 year ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 3--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Updates for this week:

tjcamp commented 1 year ago

Ran script to perform variant calling for the raw data (illumina reads) with the accession numbers as follows: 1000 Genomes Project phase 3: 30X coverage whole genome sequencing [30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples.] Experiment accession no: ERX3270178 (https://www.ncbi.nlm.nih.gov/sra/ERX3270178[accn]): Illumina NovaSeq 6000 paired end sequencing

  1. ERR4048409 - 8.3 G (no of bases)
  2. ERR4048410 - 9 G (no of bases)
  3. ERR4048411 - 8.4 G (no of bases)

Corrections made to script from last week: ERR3243163 - Will not be using this data as it seems unusually large. Updated ploidy to 2 in script

Result / Problems:

Further steps:

Script issue no - d924360eba512f9e4576d50a099ecbd3fb66100b

JingxuanChen7 commented 1 year ago

PRE-DEFINED PLOIDY FILES

GRCh37 .. Human Genome reference assembly GRCh37 / hg19

X 1 60000 M 1 X 2699521 154931043 M 1 Y 1 59373566 M 1 Y 1 59373566 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 60000 M 1 chrX 2699521 154931043 M 1 chrY 1 59373566 M 1 chrY 1 59373566 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1

GRCh38 .. Human Genome reference assembly GRCh38 / hg38

X 1 9999 M 1 X 2781480 155701381 M 1 Y 1 57227415 M 1 Y 1 57227415 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 9999 M 1 chrX 2781480 155701381 M 1 chrY 1 57227415 M 1 chrY 1 57227415 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1

X .. Treat male samples as haploid and female as diploid regardless of the chromosome name

Y .. Treat male samples as haploid and female as no-copy, regardless of the chromosome name

1 .. Treat all samples as haploid

Run as --ploidy (e.g. --ploidy GRCh37). To see the detailed ploidy definition, append a question mark (e.g. --ploidy GRCh37?).


- It seems that you are working with human genome GRCh38, is it correct? If so, I think you should use `--ploidy GRCh38` instead of `--ploidy 2`. Or eliminate this parameter to treat complete genome as diploid.