kaneplusplus / bigmemory

126 stars 24 forks source link

Last empty column gives "Dimension mismatch between header row and first data row" #82

Closed nnalpas closed 6 years ago

nnalpas commented 6 years ago

Hello,

I encountered the following error while using read.big.matrix: > evid_bm <- read.big.matrix(filename = "L:/Data/SQL_Maxquant/2018_02/P193-U1-20180204/evidence.txt", sep = "\t", header = TRUE, has.row.names = TRUE, type = "character", backingfile = "evid.bck", descriptorfile = "evid.desc", shared = options()$bigmemory.default.shared) 'Error in read.big.matrix(filename = "L:/Data/SQL_Maxquant/2018_02/P193-U1-20180204/evidence.txt", : Dimension mismatch between header row and first data row.'

I then checked my file, which I know does not have rownames but has empty values towards the last column. Therefore the first line of my file (after header) will look like this: > firstLine <- scan(file = "L:/Data/SQL_Maxquant/2018_02/P193-U1-20180204/evidence.txt", what = "character", skip = 1, nlines = 1, sep = "\n") > firstLine '[1] "AAAAAAEGIEAAEK\t14\tUnmodified\tAAAAAAEGIEAAEK\t\t\t0\t0\t0\tMC-0-1_GL0042895;44_gene_id_58438\tMC-0-1_GL0042895\tMC-0-1_GL0042895\t\t\tMULTI-MSMS\t20170103_VA_MetaDS_R114\tR114\t636.8281\t2\t636.825149\t1271.63574\t38214.52\t3.7017\t0.0023573\t0.099839\t6.358E-05\t3.8015\t0.0024209\t636.824962634561\t23.618\t0.30947\t23.618\t23.477\t23.786\t0\t\t\t\t\t32\t18\t2\t0\t0\t0\t0.00014828\t1\t13839\t92.939\t60.644\t1\t13447000\t\t\t0\t19061\t0\t0\t0\t0\t\t"'

My understanding is that the function read.big.matrix() then uses strsplit() to parse this line, however in the case of the line above the last column value (which is empty) will be ignored, resulting in a lower number of values compare to header and raising the error.

Possibly an alternative to base::strsplit() would be stringr::str_split(), which does not remove empty value towards the end. Any chance this could be implemented or any other alternative that supports last column being partially empty?

Best regards, Nicolas

privefl commented 6 years ago

Package bigmemory can be used to store numeric matrices. Seems like you have some strings in your data, which is not currently handled by big.matrix objects. Likewise, I'm not sure using type = "character" is possible.

How big is your data? You could maybe read it by chunks with data.table::fread and store the numeric columns in a big.matrix and the other information in a character matrix.

nnalpas commented 6 years ago

Hello, yes, I noticed afterwards that only numeric are allowed, my bad. Even though the error would arise if you have a numeric matrix stored in file, with empty values in the last column just after the header, such as: col1\tcol2\tcol3\tcol4 1\t2\t\t 1\t2\t3\t4

Anyway, I think it could be just my data that is messy and the fix I suggested might not be useful for other people. You're right I will just stick with fread for the time being.

Thanks again. Best regards.