gstricker / GenoGAM

http://bioconductor.org/packages/devel/bioc/html/GenoGAM.html
7 stars 4 forks source link

GenoGAMDataSet gives an error #1

Closed nevelsk90 closed 6 years ago

nevelsk90 commented 6 years ago

Dear developers,

I'm trying to construct a dataset from 4 BAM files using GenoGAMDataSet function. The function runs at first normally and progress bars are being completed for all 4 BAM files without warnings but then I'm receiving the error:

Error in DataFrame(rawData) : cannot coerce class "list" to a DataFrame

Do you have an idea what might cause this error? my experiment design:

expDesign=data.frame(ID = c("IP_rep1","IP_rep2","input_rep1","input_rep2"),
                      file = list.files(path=bamdir, pattern="MAPQ30unique.bam$", full.names =F),
                      paired = rep(FALSE,4),
                      SampleType = factor(c(1,1,0,0)),stringsAsFactors = FALSE 
                     )

Thank you in advance Regards

gstricker commented 6 years ago

Hi,

thanks for reaching out. I assume this error appears because the list elements are of different length. Here the list elements are integer vectors of the read counts, one vector for each BAM file. To check if this is indeed the case here, could you please run it through the debugger for me like that:

nevelsk90 commented 6 years ago

Hi, thanks for the prompt reply!

I've run the debugger trying to read in a couple of BAM files and there is apparently no difference in the length of the rawData elements:

Browse[2]> rawData [[1]] numeric-Rle of length 2730871774 with 96122808 runs Lengths: 3000152 1 59 1 178 2 148 1 70 ... 98 1 106 1 66 1 205 1 24 Values : 0 1 0 1 0 1 0 1 0 ... 0 1 0 1 0 1 0 1 0 [[2]] numeric-Rle of length 2730871774 with 102272468 runs Lengths: 3000123 1 61 1 31 1 62 1 38 ... 95 1 14 1 13 1 167 1 17 Values : 0 1 0 1 0 1 0 1 0 ... 0 1 0 1 0 1 0 1 0

Moreover, if I read in only one BAM file the function throws the same error as before.

I should have specified, I'm using the last stable 1.6.0 version of GenoGAM package but the same issue occurs if using a developer 1.7.0 version .

gstricker commented 6 years ago

Hi,

I think I know the problem: I just created some random RLE vectors of the same length as you have and put them in a list. Converting them to a DataFrame (note this is a S4Vectors class, not the native R data.frame) fails, however anything shorter than length 2^31 (= 2147483648) works. It seems DataFrame does not support 64bit integers.

I assume you work with something in the range of mouse data. This GenoGAM version unfortunately does not work well (or at all) with organisms larger than yeast or at most fly (computation time-wise and memory-wise). However we are currently developing a version that is orders of magnitudes faster and memory efficient as it stores data largely on hard drive. It does not offer yet the downstream analysis functionality of this version (peak calling, differential binding etc.) but it should be already stable to use in order to simply obtain the fit of the model.

If you are interested to use that, it's on my GitHub under working title "fastGenoGAM". It has pretty much the same workflow with a few additional parameters. If you write me an email under georg.stricker@in.tum.de I can send you an example workflow for better reference (as it has no real vignette yet).

We try to release it as the next GenoGAM version in one of the Bioconductor release cycles this year.

Best, Georg

nevelsk90 commented 6 years ago

Exactly, I'm working with the mouse genome, so we have figured out the problem. Thanks for the updates.