sr320 / ceabigr

Workshop on genomic data integration with a emphasis on epigenetic data (FHL 2022)
4 stars 2 forks source link

Identify SNPs in BS data #10

Closed sr320 closed 4 months ago

sr320 commented 2 years ago

I am going to try to tackle this but anyone else is welcome to tackle (assign yourself and let me know what you need).

sr320 commented 2 years ago

if you want sorted, deduplicated bams from Bismark

wget -r \
--no-check-certificate \
--quiet \
--no-directories --no-parent \
-P . \  #where files go - indicating current directory
-A *sorted.bam \
https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/071521-cvBS/
kubu4 commented 2 years ago

I'm going to play around a bit with this:

EpiDiverse Nextflow pipeline

sr320 commented 2 years ago

this

cd /home/sr320/github/2018_L18-adult-methylation/bg_data/
FILES=$(ls *deduplicated.sorted.bam)
echo ${FILES}
cd /home/shared/8TB_HDD_01/sr320/github/nb-2022/C_virginica/code/
for file in ${FILES}
do
    NAME=$(echo ${file} | awk -F "." '{print $1}')
    echo ${NAME}

    perl /home/shared/BS-Snper-master/BS-Snper.pl \
    --fa ../data/Cvirginica_v300.fa \
    --input /home/sr320/github/2018_L18-adult-methylation/bg_data/${NAME}.deduplicated.sorted.bam \
    --output ../analyses/bsnp01/${NAME}.SNP-candidates.txt \
    --methcg ../analyses/bsnp01/${NAME}.CpG-meth-info.tab \
    --methchg ../analyses/bsnp01/${NAME}.CHG-meth-info.tab \
    --methchh ../analyses/bsnp01/${NAME}.CHH-meth-info.tab \
    --mincover 5 \
    > ../analyses/bsnp01/${NAME}.SNP-results.vcf
done

Generated

https://gannet.fish.washington.edu/seashell/bu-github/nb-2022/C_virginica/analyses/bsnp01/

tldr

sr320@raven:/home/shared/8TB_HDD_01/sr320/github/nb-2022/C_virginica/analyses/bsnp01$ head 9M*
==> 9M_R1_val_1_bismark_bt2_pe.CHG-meth-info.tab <==
#CHROM  POS     CONTEXT Watson-METH     Watson-COVERAGE Watson-QUAL     Crick-METH      Crick-COVERAGE  Crick-QUAL
NC_035780.1     40      CHG     0       267     36      0       63      35
NC_035780.1     54      CHG     0       326     36      .       .       .
NC_035780.1     100     CHG     0       460     36      1       157     36
NC_035780.1     116     CHG     2       502     36      .       .       .
NC_035780.1     119     CHG     .       .       .       0       169     36
NC_035780.1     123     CHG     0       467     36      0       167     35
NC_035780.1     159     CHG     1       324     35      .       .       .
NC_035780.1     162     CHG     .       .       .       0       152     36
NC_035780.1     224     CHG     0       392     35      .       .       .

==> 9M_R1_val_1_bismark_bt2_pe.CHH-meth-info.tab <==
#CHROM  POS     CONTEXT Watson-METH     Watson-COVERAGE Watson-QUAL     Crick-METH      Crick-COVERAGE  Crick-QUAL
NC_035780.1     4       CHH     0       11      34      .       .       .
NC_035780.1     6       CHH     0       19      35      .       .       .
NC_035780.1     16      CHH     .       .       .       0       30      36
NC_035780.1     19      CHH     .       .       .       0       33      36
NC_035780.1     22      CHH     .       .       .       0       37      36
NC_035780.1     24      CHH     1       196     35      .       .       .
NC_035780.1     25      CHH     0       207     36      .       .       .
NC_035780.1     34      CHH     .       .       .       1       55      36
NC_035780.1     38      CHH     0       283     35      .       .       .

==> 9M_R1_val_1_bismark_bt2_pe.CpG-meth-info.tab <==
#CHROM  POS     CONTEXT Watson-METH     Watson-COVERAGE Watson-QUAL     Crick-METH      Crick-COVERAGE  Crick-QUAL
NC_035780.1     29      CG      0       227     36      0       51      36
NC_035780.1     55      CG      1       330     36      0       95      35
NC_035780.1     76      CG      4       360     36      2       145     36
NC_035780.1     94      CG      2       412     36      0       166     36
NC_035780.1     104     CG      0       477     36      1       168     36
NC_035780.1     117     CG      2       500     36      0       171     36
NC_035780.1     135     CG      0       417     36      0       172     36
NC_035780.1     160     CG      1       322     36      0       151     36
NC_035780.1     210     CG      2       396     36      0       75      36

==> 9M_R1_val_1_bismark_bt2_pe.SNP-candidates.txt <==
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  GENOTYPE        FREQUENCY       Number_of_watson[A,T,C,G]       Number_of_crick[A,T,C,G]        Mean_Quality_of_Watson[A,T,C,G] Mean_Quality_of_Crick[A,T,C,G]
NC_035780.1     123     C       0,0,466,0       0,98,75,0       0,0,36,0        0,36,37,0      28       26
NC_035780.1     930     G       1,0,0,3 0,0,0,10        37,0,0,37       0,0,0,36        33     28
NC_035780.1     1472    A       3,2,0,10        560,0,14,1      33,37,0,35      37,0,37,37     26       33
NC_035780.1     1643    C       0,0,0,0 1,0,0,0 0,0,0,0 37,0,0,0        0       24
NC_035780.1     1831    C       0,0,4,0 0,1,2,0 0,0,37,0        0,37,37,0       31      28
NC_035780.1     2357    A       0,1,0,0 0,0,0,0 0,37,0,0        0,0,0,0 42      0
NC_035780.1     2563    G       0,26,0,0        0,29,2,0        0,36,0,0        0,35,31,0      41       38
NC_035780.1     2605    A       0,0,0,26        25,0,0,0        0,0,0,36        34,0,0,0       39       39
NC_035780.1     2700    A       0,1,0,0 1,0,1,0 0,37,0,0        37,0,37,0       42      32

==> 9M_R1_val_1_bismark_bt2_pe.SNP-results.vcf <==
##fileformat=VCFv4.3
##fileDate= 20220211
##bssnperVersion=1.1
##bssnperCommand=--fa ../data/Cvirginica_v300.fa   --input /home/sr320/github/2018_L18-adult-methylation/bg_data/9M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam --output ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.SNP-candidates.txt --methcg ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CpG-meth-info.tab --methchg ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CHG-meth-info.tab --methchh ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CHH-meth-info.tab --minhetfreq  0.1 --minhomfreq  0.85 --minquali 15 --mincover 5 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20
...
NC_007175.2     16711   .       T       C       1000    PASS    DP=191;ADF=0,107;ADR=0,84;AD=0,191;     GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR   1/1:191:0,107:0,84:0,191:0,106,1,1,0,0,84,0:0,36,37,37,0,0,36,0:0.000,1.000
NC_007175.2     16744   .       G       A       1000    PASS    DP=104;ADF=1,103;ADR=0,0;AD=1,103;      GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR   0/1:104:1,103:0,0:1,103:103,0,0,1,0,0,0,104:36,0,0,37,0,0,0,36:0.010,0.990
NC_007175.2     16883   .       G       T       1000    PASS    DP=238;ADF=0,135;ADR=0,103;AD=0,238;    GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR   1/1:238:0,135:0,103:0,238:0,135,0,0,0,103,0,0:0,36,0,0,0,36,0,0:0.000,1.000
NC_007175.2     17211   .       A       G       5       PASS    DP=7;ADF=5,2;ADR=0,0;AD=5,2;   GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR    0/1:7:5,2:0,0:5,2:5,0,0,2,11,0,0,0:37,0,0,31,36,0,0,0:0.714,0.286
NC_007175.2     17228   .       T       A       12      PASS    DP=11;ADF=2,2;ADR=7,0;AD=9,2;  GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR    0/1:11:2,2:7,0:9,2:2,2,0,0,0,7,0,0:37,37,0,0,0,37,0,0:0.818,0.182
sr320 commented 2 years ago

And with 10x coverage threshold https://gannet.fish.washington.edu/seashell/bu-github/nb-2022/C_virginica/analyses/bsnp02/

ksil91 commented 2 years ago

So you got 1 vcf file for each individual. Bcftools merge should merge them into 1 vcf: https://samtools.github.io/bcftools/bcftools.html#merge
bcftools merge sample1.vcf sample2.vcf sample3.vcf --merge all -O v -o merged.vcf

I would then filter with vcftools a bit so it is smaller/less noisy. Max missing 50%, minor allele count >=2, biallelic SNPs:

vcftools --vcf merged.vcf --recode --recode-INFO-all \
--min-alleles 2 --max-alleles 2 \
--max-missing 0.5 \
--mac 2 \
--out merged.filtered
sr320 commented 2 years ago

@ksil91 getting closer...

however as @kubu4 pointed out to me in slack

Just wanted to give you the heads up in case something doesn't go right. For example, vcftools can't handle the newest VCF format.

thus when I filter with vcftools as suggested...

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --vcf Cv_10x_merged.vcf
        --recode-INFO-all
        --mac 2
        --max-alleles 2
        --min-alleles 2
        --max-missing 0.5
        --out Cv_10x_merged.filtered
        --recode

Error: VCF version must be v4.0, v4.1 or v4.2:
You are using version VCFv4.3

do I have to go back upstream and deprecate to lower version of another program? see https://github.com/sr320/nb-2022/blob/main/C_virginica/code/08-job.sh

ksil91 commented 2 years ago

Try this (just changing the file header so it says 4.2) https://www.biostars.org/p/325012/

If that doesn't work, I should be able to filter relatively quickly in R. Can we install packages ourselves on the rStudio server?

sr320 commented 2 years ago

Yes you should be able to install packages.

sr320 commented 2 years ago

Merged VCF - 16G https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/021722-vcfmerge/Cv_10x_merged.vcf

Can we changed header there or need to go back up to step one - I will try changing this one

yaaminiv commented 2 years ago

@sr320 Where can I find BS-Snper output for each sample? I just want C->T SNP data

sr320 commented 2 years ago

SNPs in BS data have been identified.

filtered file.. https://github.com/sr320/ceabigr/blob/main/data/Cv_10x_fxmerge_05.recode.vcf

ksil91 commented 2 years ago

I think that Epidiverse-SNP pipeline is the way to go. https://github.com/EpiDiverse/snp