Final_Project - Githubissues

tjcamp commented 2 years ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Involved in metabolism of nearly 25% of clinically prescribed medications
Highly polymorphic gene
The Pharmacogene Variation Consortium (PharmVar) has cataloged over 140 CYP2D6 haplotypes
Aim - to understand distribution of CYP2D6 star alleles
- characterise novel star alleles

Paper & dataset: Nanopore sequencing of the pharmacogene CYP2D6 allows simultaneous haplotyping and detection of duplications - https://pubmed.ncbi.nlm.nih.gov/31559921/ Reference sequence - NG_008376.3 (8593 bps) Sorted BAM files - https://github.com/yusmiatiliau/CYP2D_reference_BAM

Envisioned workflow: 1] Convert available sample BAM files to fastq with samtools fastq 2] Mapping the samples with reference genome using minimap2, NGMLR & BWA 3] Variant calling - BCFtools mpileup (although I have read this does not work well with long read nanopore sequenced samples), Clairvoyante & Nanopolish. 4] Compare results with reference paper (they have used clairvoyante and nanopolish) 5] Interpret, analyse and annotate variants with Ensembl VEP.

References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html Using VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Nanopolish - https://nanopolish.readthedocs.io/en/latest/manual.html Clairvoyante - https://github.com/aquaskyline/Clairvoyante

NOTE: Since the current paper I selected uses Oxford nanopre sequencing data, here is another paper whose raw data uses Illumina reads. https://www.nature.com/articles/s41397-020-00205-5

cbergman commented 2 years ago

Emailed @tjcamp on 11/10 to calrify project goals

tjcamp commented 2 years ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 2--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Involved in metabolism of nearly 25% of clinically prescribed medications Highly polymorphic gene The Pharmacogene Variation Consortium (PharmVar) has cataloged over 140 CYP2D6 haplotypes Aim - Use the Variant Effect Predictor tool to find the effects of SNPs in the CYP2D6 gene.

Dataset: https://www.nature.com/articles/s41397-020-00205-5 (Using Illumina reads to study the effects of SNPs on the CYP2D6 gene.)

Envisioned workflow: 1] Extract Illumina reads 2] Indexing and converting files to BAM format using BWA 3] Mapping samples with reference genome 3] Variant calling (SNPs) - BCFtools mileup 4] Interpret, analyse and annotate variants with Ensembl VEP.

References to be used: Annotating genomic variants using Ensembl VEP - https://onlinelibrary.wiley.com/doi/full/10.1002/humu.24298 Ensebl Variant Effect Predictor - https://pubmed.ncbi.nlm.nih.gov/27268795/ VEP - https://useast.ensembl.org/info/docs/tools/vep/script/index.html

tjcamp commented 2 years ago

/ --- PROJECT - VARIANT PREDICTION IN CYP2D6 - UPDATE 3--- /

Investigate CYP2D6 Pharmacogenetic Variation:

Updates for this week:

Created script:
1. to extract illumina reads from raw data.
2. to map the converted BAM sample files with the reference genome (grch38).
3. to call SNPs with BCFtools
Pushed this script to github
The script ran for over 2 hours and had to cancel due to time constraints. Need help to confirm if the raw data used is correct. (Email sent for the same)
Studied on how to use and analyze data with variant effect predictor.

tjcamp commented 2 years ago

Ran script to perform variant calling for the raw data (illumina reads) with the accession numbers as follows: 1000 Genomes Project phase 3: 30X coverage whole genome sequencing [30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples.] Experiment accession no: ERX3270178 (https://www.ncbi.nlm.nih.gov/sra/ERX3270178[accn]): Illumina NovaSeq 6000 paired end sequencing

ERR4048409 - 8.3 G (no of bases)
ERR4048410 - 9 G (no of bases)
ERR4048411 - 8.4 G (no of bases)

Corrections made to script from last week: ERR3243163 - Will not be using this data as it seems unusually large. Updated ploidy to 2 in script

Result / Problems:

BAM files are empty in the output.
Working on script to rectify the issue.

Further steps:

Use IGV to view variants.
Use VEP to analyse variants.

Script issue no - d924360eba512f9e4576d50a099ecbd3fb66100b

JingxuanChen7 commented 1 year ago

I have replied about bam file issue in Slack already.
@tjcamp I think there is another issue in your bcftools call. The --ploidy option is not to specify the number of ploidy, but used to specify regions that should be treated as haploid (like sexual). See below for details:
```
(base) jc33471@teach-sub1 final_project$ bcftools call --ploidy ?
```

PRE-DEFINED PLOIDY FILES

Columns are: CHROM,FROM,TO,SEX,PLOIDY
Coordinates are 1-based inclusive.
A '*' means any value not otherwise defined.

GRCh37 .. Human Genome reference assembly GRCh37 / hg19

X 1 60000 M 1 X 2699521 154931043 M 1 Y 1 59373566 M 1 Y 1 59373566 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 60000 M 1 chrX 2699521 154931043 M 1 chrY 1 59373566 M 1 chrY 1 59373566 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1

- - M 2
- - F 2

GRCh38 .. Human Genome reference assembly GRCh38 / hg38

X 1 9999 M 1 X 2781480 155701381 M 1 Y 1 57227415 M 1 Y 1 57227415 F 0 MT 1 16569 M 1 MT 1 16569 F 1 chrX 1 9999 M 1 chrX 2781480 155701381 M 1 chrY 1 57227415 M 1 chrY 1 57227415 F 0 chrM 1 16569 M 1 chrM 1 16569 F 1

- - M 2
- - F 2

X .. Treat male samples as haploid and female as diploid regardless of the chromosome name

- - M 1
- - F 2

Y .. Treat male samples as haploid and female as no-copy, regardless of the chromosome name

- - M 1
- - F 0

1 .. Treat all samples as haploid

- - - 1

Run as --ploidy (e.g. --ploidy GRCh37). To see the detailed ploidy definition, append a question mark (e.g. --ploidy GRCh37?).


- It seems that you are working with human genome GRCh38, is it correct? If so, I think you should use `--ploidy GRCh38` instead of `--ploidy 2`. Or eliminate this parameter to treat complete genome as diploid.

tjcamp / BINF8940

Final_Project #6