nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
134 stars 37 forks source link

wgbstools convert is not working as expected #59

Closed ekushele closed 8 months ago

ekushele commented 10 months ago

Hello,

I'm trying to use wgbstools convert to translates genomic loci to CpG-index, and it's not working as expected.

The command I used: wgbstools convert -L check_head.bed --genome hg38 --debug

Original file:

chr1    10468   10469   22  29
chr1    10470   10471   21  28
chr1    10483   10484   28  29
chr1    10488   10489   21  28
chr1    10492   10493   25  28
chr1    10496   10497   26  27
chr1    10524   10525   26  28
chr1    10541   10542   26  29
chr1    10562   10563   28  31
chr1    10570   10571   29  31

Expected output:

chr1    10468   10469   1   2   22  29
chr1    10470   10471   2   3   21  28
chr1    10483   10484   3   4   28  29
chr1    10488   10489   4   5   21  28
chr1    10492   10493   5   6   25  28
chr1    10496   10497   6   7   26  27
chr1    10524   10525   7   8   26  28
chr1    10541   10542   8   9   26  29
chr1    10562   10563   10  11  28  31
chr1    10570   10571   11  12  29  31

The output I get:

chr1    10468   10469   NA  NA  22  29
chr1    10470   10471   NA  NA  21  28
chr1    10483   10484   NA  NA  28  29
chr1    10488   10489   NA  NA  21  28
chr1    10492   10493   NA  NA  25  28
chr1    10496   10497   NA  NA  26  27
chr1    10524   10525   NA  NA  26  28
chr1    10541   10542   NA  NA  26  29
chr1    10562   10563   NA  NA  28  31
chr1    10570   10571   NA  NA  29  31

I ran with --debug and the command that was printed before the output was: tabix -R /tmp/tmpfk4kdgtm /path/wgbs_tools/references/hg38/CpG.bed.gz | awk -v OFS='\t' '{print $1,$2,$2+1,$3}' | sort -k1,1 -k2,2n -u | bedtools intersect -sorted -b - -a /tmp/tmpfk4kdgtm -loj | bedtools groupby -g 1,2,3 -c 7,7 -o first,last | awk -v OFS='\t' '{print $1,$2,$3,$4,$5+1;}' | sed 's/\.\t1/NA\tNA/g'

@nloyfer I'll appreciate if you can take a look.

Thank you!

nloyfer commented 10 months ago

I think it's because check_head.bed is 0-based. Try

awk -v OFS='\t' '{print $1,$2+1,$3+1}' check_head.bed > check_head1.bed
wgbstools convert -L check_head1.bed --genome hg38
ekushele commented 10 months ago

Thank you!