nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
125 stars 33 forks source link

The *.beta files and *.bigwig files are inconsistent. #7

Closed liu930724 closed 2 years ago

liu930724 commented 2 years ago

Hi, I downloaded the data from the GSE186458 and want to find markers per cell type. But the beta files and the bigwig files are inconsistent.

wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5652nnn/GSM5652176/suppl/GSM5652176_Adipocytes-Z000000T7.beta.gz .
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5652nnn/GSM5652176/suppl/GSM5652176_Adipocytes-Z000000T7.bigwig .

md5sum GSM*
a90af84bfa70e783fa540347db1bec41  GSM5652176_Adipocytes-Z000000T7.beta.gz
7ff5b4efdafd8002ebbb639472e6d1be  GSM5652176_Adipocytes-Z000000T7.bigwig

wgbstools beta2bed -o Z000000T7.bed GSM5652176_Adipocytes-Z000000T7.beta.gz
bigWigToBedGraph GSM5652176_Adipocytes-Z000000T7.bigwig Z000000T7.bedGraph

awk '{printf ("%s\t%s\t%s\t%.3f\n",$1,$2,$3,$4/$5)}' Z000000T7.bed > Z000000T7.format.bed
awk '{printf ("%s\t%s\t%s\t%.3f\n",$1,$2,$3,$4)}' Z000000T7.bedGraph > Z000000T7.format.bedGraph

In this case, the intersection of the two files is 26,179,734 records. There are 472,378 records only in beta file, and 1,747,426 records only in bigwig file.

Theoretically, the methylation ratio should be 0 to 1. However, there are more than 5% records with methylation rate of -1 in the bigwig file. The other inconsistencies between two files are chromosomal end sites.

$ awk '$4<0' Z000000T7.format.bedGraph | head
chr1    10983   10985   -1.000
chr1    10989   10991   -1.000
chr1    10997   10999   -1.000

$ grep chr5 Z000000T7.format.bed | grep 141316059
chr5    141316059   141316061   1.000

$ grep chr5 Z000000T7.format.bedGraph | grep 141316059
chr5    141316059   141316063   1.000

Which data is correct? Should I use beta files to find markers?

Thanks.

liu930724 commented 2 years ago

Problem solved~ bigWigToWig should be used instead of bigWigTobedGraph.

bigWigToWig GSM5652176_Adipocytes-Z000000T7.bigwig Z000000T7.wig
grep -v "#" Z000000T7.wig | awk '$4>=0{printf ("%s\t%s\t%s\t%.3f\n",$1,$2,$3,$4)}' > Z000000T7.format.wig

$ wc -l *.format*
  26652112 Z000000T7.format.bed
  27927160 Z000000T7.format.bedGraph
  26652112 Z000000T7.format.wig

$ for i in `ls *.format.*`;do echo $i;awk '{sum[$3-$2]+=1} END {for(k in sum) print k ":" sum[k]}' $i;done
Z000000T7.format.bed
2:26652112
Z000000T7.format.bedGraph
26:1
4:248729
18:7
...
16:13
Z000000T7.format.wig
2:26652112