Closed sr320 closed 4 months ago
if you want sorted, deduplicated bams from Bismark
wget -r \
--no-check-certificate \
--quiet \
--no-directories --no-parent \
-P . \ #where files go - indicating current directory
-A *sorted.bam \
https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/071521-cvBS/
I'm going to play around a bit with this:
this
cd /home/sr320/github/2018_L18-adult-methylation/bg_data/
FILES=$(ls *deduplicated.sorted.bam)
echo ${FILES}
cd /home/shared/8TB_HDD_01/sr320/github/nb-2022/C_virginica/code/
for file in ${FILES}
do
NAME=$(echo ${file} | awk -F "." '{print $1}')
echo ${NAME}
perl /home/shared/BS-Snper-master/BS-Snper.pl \
--fa ../data/Cvirginica_v300.fa \
--input /home/sr320/github/2018_L18-adult-methylation/bg_data/${NAME}.deduplicated.sorted.bam \
--output ../analyses/bsnp01/${NAME}.SNP-candidates.txt \
--methcg ../analyses/bsnp01/${NAME}.CpG-meth-info.tab \
--methchg ../analyses/bsnp01/${NAME}.CHG-meth-info.tab \
--methchh ../analyses/bsnp01/${NAME}.CHH-meth-info.tab \
--mincover 5 \
> ../analyses/bsnp01/${NAME}.SNP-results.vcf
done
Generated
https://gannet.fish.washington.edu/seashell/bu-github/nb-2022/C_virginica/analyses/bsnp01/
tldr
sr320@raven:/home/shared/8TB_HDD_01/sr320/github/nb-2022/C_virginica/analyses/bsnp01$ head 9M*
==> 9M_R1_val_1_bismark_bt2_pe.CHG-meth-info.tab <==
#CHROM POS CONTEXT Watson-METH Watson-COVERAGE Watson-QUAL Crick-METH Crick-COVERAGE Crick-QUAL
NC_035780.1 40 CHG 0 267 36 0 63 35
NC_035780.1 54 CHG 0 326 36 . . .
NC_035780.1 100 CHG 0 460 36 1 157 36
NC_035780.1 116 CHG 2 502 36 . . .
NC_035780.1 119 CHG . . . 0 169 36
NC_035780.1 123 CHG 0 467 36 0 167 35
NC_035780.1 159 CHG 1 324 35 . . .
NC_035780.1 162 CHG . . . 0 152 36
NC_035780.1 224 CHG 0 392 35 . . .
==> 9M_R1_val_1_bismark_bt2_pe.CHH-meth-info.tab <==
#CHROM POS CONTEXT Watson-METH Watson-COVERAGE Watson-QUAL Crick-METH Crick-COVERAGE Crick-QUAL
NC_035780.1 4 CHH 0 11 34 . . .
NC_035780.1 6 CHH 0 19 35 . . .
NC_035780.1 16 CHH . . . 0 30 36
NC_035780.1 19 CHH . . . 0 33 36
NC_035780.1 22 CHH . . . 0 37 36
NC_035780.1 24 CHH 1 196 35 . . .
NC_035780.1 25 CHH 0 207 36 . . .
NC_035780.1 34 CHH . . . 1 55 36
NC_035780.1 38 CHH 0 283 35 . . .
==> 9M_R1_val_1_bismark_bt2_pe.CpG-meth-info.tab <==
#CHROM POS CONTEXT Watson-METH Watson-COVERAGE Watson-QUAL Crick-METH Crick-COVERAGE Crick-QUAL
NC_035780.1 29 CG 0 227 36 0 51 36
NC_035780.1 55 CG 1 330 36 0 95 35
NC_035780.1 76 CG 4 360 36 2 145 36
NC_035780.1 94 CG 2 412 36 0 166 36
NC_035780.1 104 CG 0 477 36 1 168 36
NC_035780.1 117 CG 2 500 36 0 171 36
NC_035780.1 135 CG 0 417 36 0 172 36
NC_035780.1 160 CG 1 322 36 0 151 36
NC_035780.1 210 CG 2 396 36 0 75 36
==> 9M_R1_val_1_bismark_bt2_pe.SNP-candidates.txt <==
#CHROM POS ID REF ALT QUAL FILTER GENOTYPE FREQUENCY Number_of_watson[A,T,C,G] Number_of_crick[A,T,C,G] Mean_Quality_of_Watson[A,T,C,G] Mean_Quality_of_Crick[A,T,C,G]
NC_035780.1 123 C 0,0,466,0 0,98,75,0 0,0,36,0 0,36,37,0 28 26
NC_035780.1 930 G 1,0,0,3 0,0,0,10 37,0,0,37 0,0,0,36 33 28
NC_035780.1 1472 A 3,2,0,10 560,0,14,1 33,37,0,35 37,0,37,37 26 33
NC_035780.1 1643 C 0,0,0,0 1,0,0,0 0,0,0,0 37,0,0,0 0 24
NC_035780.1 1831 C 0,0,4,0 0,1,2,0 0,0,37,0 0,37,37,0 31 28
NC_035780.1 2357 A 0,1,0,0 0,0,0,0 0,37,0,0 0,0,0,0 42 0
NC_035780.1 2563 G 0,26,0,0 0,29,2,0 0,36,0,0 0,35,31,0 41 38
NC_035780.1 2605 A 0,0,0,26 25,0,0,0 0,0,0,36 34,0,0,0 39 39
NC_035780.1 2700 A 0,1,0,0 1,0,1,0 0,37,0,0 37,0,37,0 42 32
==> 9M_R1_val_1_bismark_bt2_pe.SNP-results.vcf <==
##fileformat=VCFv4.3
##fileDate= 20220211
##bssnperVersion=1.1
##bssnperCommand=--fa ../data/Cvirginica_v300.fa --input /home/sr320/github/2018_L18-adult-methylation/bg_data/9M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam --output ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.SNP-candidates.txt --methcg ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CpG-meth-info.tab --methchg ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CHG-meth-info.tab --methchh ../analyses/bsnp01/9M_R1_val_1_bismark_bt2_pe.CHH-meth-info.tab --minhetfreq 0.1 --minhomfreq 0.85 --minquali 15 --mincover 5 --maxcover 1000 --minread2 2 --errorate 0.02 --mapvalue 20
...
NC_007175.2 16711 . T C 1000 PASS DP=191;ADF=0,107;ADR=0,84;AD=0,191; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 1/1:191:0,107:0,84:0,191:0,106,1,1,0,0,84,0:0,36,37,37,0,0,36,0:0.000,1.000
NC_007175.2 16744 . G A 1000 PASS DP=104;ADF=1,103;ADR=0,0;AD=1,103; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:104:1,103:0,0:1,103:103,0,0,1,0,0,0,104:36,0,0,37,0,0,0,36:0.010,0.990
NC_007175.2 16883 . G T 1000 PASS DP=238;ADF=0,135;ADR=0,103;AD=0,238; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 1/1:238:0,135:0,103:0,238:0,135,0,0,0,103,0,0:0,36,0,0,0,36,0,0:0.000,1.000
NC_007175.2 17211 . A G 5 PASS DP=7;ADF=5,2;ADR=0,0;AD=5,2; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:7:5,2:0,0:5,2:5,0,0,2,11,0,0,0:37,0,0,31,36,0,0,0:0.714,0.286
NC_007175.2 17228 . T A 12 PASS DP=11;ADF=2,2;ADR=7,0;AD=9,2; GT:DP:ADF:ADR:AD:BSD:BSQ:ALFR 0/1:11:2,2:7,0:9,2:2,2,0,0,0,7,0,0:37,37,0,0,0,37,0,0:0.818,0.182
And with 10x coverage threshold https://gannet.fish.washington.edu/seashell/bu-github/nb-2022/C_virginica/analyses/bsnp02/
So you got 1 vcf file for each individual. Bcftools merge should merge them into 1 vcf:
https://samtools.github.io/bcftools/bcftools.html#merge
bcftools merge sample1.vcf sample2.vcf sample3.vcf --merge all -O v -o merged.vcf
I would then filter with vcftools a bit so it is smaller/less noisy. Max missing 50%, minor allele count >=2, biallelic SNPs:
vcftools --vcf merged.vcf --recode --recode-INFO-all \
--min-alleles 2 --max-alleles 2 \
--max-missing 0.5 \
--mac 2 \
--out merged.filtered
@ksil91 getting closer...
however as @kubu4 pointed out to me in slack
Just wanted to give you the heads up in case something doesn't go right. For example, vcftools can't handle the newest VCF format.
thus when I filter with vcftools as suggested...
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf Cv_10x_merged.vcf
--recode-INFO-all
--mac 2
--max-alleles 2
--min-alleles 2
--max-missing 0.5
--out Cv_10x_merged.filtered
--recode
Error: VCF version must be v4.0, v4.1 or v4.2:
You are using version VCFv4.3
do I have to go back upstream and deprecate to lower version of another program? see https://github.com/sr320/nb-2022/blob/main/C_virginica/code/08-job.sh
Try this (just changing the file header so it says 4.2) https://www.biostars.org/p/325012/
If that doesn't work, I should be able to filter relatively quickly in R. Can we install packages ourselves on the rStudio server?
Yes you should be able to install packages.
Merged VCF - 16G https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/021722-vcfmerge/Cv_10x_merged.vcf
Can we changed header there or need to go back up to step one - I will try changing this one
@sr320 Where can I find BS-Snper output for each sample? I just want C->T SNP data
SNPs in BS data have been identified.
filtered file.. https://github.com/sr320/ceabigr/blob/main/data/Cv_10x_fxmerge_05.recode.vcf
I think that Epidiverse-SNP pipeline is the way to go. https://github.com/EpiDiverse/snp
I am going to try to tackle this but anyone else is welcome to tackle (assign yourself and let me know what you need).