heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

gemBS extract bed file has incorrect delimiters #69

Closed paul-sud closed 4 years ago

paul-sud commented 4 years ago

In recent updates to the code it looks like the gemBS style bed files have an extra space inserted as a delimiter after the third column after the feature end coordinate, in addition to the usual tab. Oddly, this doesn't seem to occur with the ENCODE-style bed files. tabix also seemed to run successfully on the file too. The data here was produced by running a subset of the test set from here: http://statgen.cnag.cat/GEMBS/UserGuide/_build/html/example.html

ENCODE-style bed files:

> head -n 10 sample_sample5_cpg.bed | tr '  '  '#'
track#name="sample5"#description="sample5"#visibility=2#itemRgb="On"
chrI    144 145 "sample5"   0   +   144 145 0,255,0 0   0   CG  CG  80
chrI    145 146 "sample5"   27  -   145 146 0,255,0 27  0   CG  CG  2
chrI    175 176 "sample5"   0   +   175 176 0,255,0 0   0   CG  CG  107
chrI    176 177 "sample5"   39  -   176 177 0,255,0 39  0   CG  CG  2
chrI    216 217 "sample5"   0   +   216 217 0,255,0 0   0   CG  CG  107
chrI    217 218 "sample5"   36  -   217 218 0,255,0 36  0   CG  CG  2
chrI    325 326 "sample5"   0   +   325 326 0,255,0 0   0   CG  CG  80
chrI    326 327 "sample5"   28  -   326 327 0,255,0 28  0   CG  CG  2
chrI    339 340 "sample5"   1   +   339 340 0,255,0 1   0   CG  CG  73

gemBS-style bed file, note the # after the feature stop coordinate:

> head -n 10 sample_sample5_cpg.txt | tr  '  '  '#'
Contig  Pos0    Pos1    Ref sample_sample5:Call sample_sample5:Flags    sample_sample5:Meth sample_sample5:non_conv sample_sample5:conv sample_sample5:support_call sample_sample5:total
chrI    834 835#    C   C   GQ=72;MQ=60 0.000   0   4   39  40
chrI    835 836#    G   G   GQ=13;MQ=60 0.019   1   35  40  40
chrI    868 869#    C   C   GQ=81;MQ=60 0.000   0   4   31  31
chrI    869 870#    G   G   GQ=12;MQ=60 0.000   0   28  32  32
chrI    890 891#    C   C   GQ=87;MQ=60 0.000   0   4   33  33
chrI    891 892#    G   G   GQ=13;MQ=60 0.026   1   28  33  33
chrI    936 937#    C   C   GQ=123;MQ=60    0.344   1   2   44  44
chrI    937 938#    G   G   GQ=10;MQ=60 0.197   8   33  44  44
chrI    955 956#    C   C   GQ=118;MQ=60    0.699   2   1   42  42

I have some downstream processing that relies on the bed files being well-formed, and would appreciate a solution.

Thanks,

Paul

paul-sud commented 4 years ago

I should add, this is using the most current gemBS revision.

heathsc commented 4 years ago

Hi Paul,

Thanks for the report - clearly a bug. Is this from the master or devel branch?

Cheers, Simon

On Mon, Jan 27, 2020 at 5:33 PM Paul Sud notifications@github.com wrote:

Reopened #69 https://github.com/heathsc/gemBS/issues/69.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/69?email_source=notifications&email_token=AAY4652I3CIEI4ID7KVMCN3Q74EGHA5CNFSM4KMEGMNKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWHCYTVA#event-2982513108, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4655HKDN27PQ2FK2BGYDQ74EGHANCNFSM4KMEGMNA .

paul-sud commented 4 years ago

Thanks for the quick response Simon. This is on master

heathsc commented 4 years ago

OK, in the file gemBS/tools/gemBS_plugins/output.c, line 133 is currently:

fprintf(fp,"%s\t%" PRId64 "\t%" PRId64 " \t%c", args->hdr->id [BCF_DT_CTG][rec->rid].key, rec->pos + pos, rec->pos + pos + 1, cx_len >= 3

fprintf(fp,"%s\t%" PRId64 "\t%" PRId64 "\t%c", args->hdr->id [BCF_DT_CTG][rec->rid].key, rec->pos + pos, rec->pos + pos + 1, cx_len >= 3

If this solves your issue then I will push the fix to master.

Thanks again for the report. Simon

On Mon, Jan 27, 2020 at 5:42 PM Paul Sud notifications@github.com wrote:

Thanks for the quick response Simon. This is on master

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/69?email_source=notifications&email_token=AAY46557C34P3VJVOOPOYQ3Q74FGNA5CNFSM4KMEGMNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKAFY4A#issuecomment-578837616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY46522G23JAPQW5OTOQW3Q74FGNANCNFSM4KMEGMNA .

heathsc commented 4 years ago

I have just pushed the fix to master (and devel as the same fault was there as well)

On Mon, Jan 27, 2020 at 5:55 PM Simon Heath simon.heath@gmail.com wrote:

OK, in the file gemBS/tools/gemBS_plugins/output.c, line 133 is currently:

fprintf(fp,"%s\t%" PRId64 "\t%" PRId64 " \t%c", args->hdr->id [BCF_DT_CTG][rec->rid].key, rec->pos + pos, rec->pos + pos + 1, cx_len >= 3 + pos ? cx[2 + pos] : '.');

and should be changed to:

fprintf(fp,"%s\t%" PRId64 "\t%" PRId64 "\t%c", args->hdr->id [BCF_DT_CTG][rec->rid].key, rec->pos + pos, rec->pos + pos + 1, cx_len >= 3 + pos ? cx[2 + pos] : '.');

If this solves your issue then I will push the fix to master.

Thanks again for the report. Simon

On Mon, Jan 27, 2020 at 5:42 PM Paul Sud notifications@github.com wrote:

Thanks for the quick response Simon. This is on master

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/69?email_source=notifications&email_token=AAY46557C34P3VJVOOPOYQ3Q74FGNA5CNFSM4KMEGMNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKAFY4A#issuecomment-578837616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY46522G23JAPQW5OTOQW3Q74FGNANCNFSM4KMEGMNA .

paul-sud commented 4 years ago

Thanks Simon, the fix on master worked.