hansenlab / minfi

Devel repository for minfi
58 stars 68 forks source link

readGEORawFile broken #111

Closed hinri closed 6 years ago

hinri commented 7 years ago

Using minfi version 1.22.1 in R 3.4.0 I'm encountering an error when loading a GEO (example) data set. It seems that there is a problem in constructing the pData dataFrame, please can you have a look at it? Thank in advance, Hindrik

gmset=readGEORawFile("GSE29290_Matrix_Signal.txt",Uname="Signal_A",Mname="Signal_B",sep="\t") Read 485577 rows and 45 (of 67) columns from 0.156 GB file in 00:00:03 Error in .local(assays, ...) : unused argument (pData = <S4 object of class "DataFrame">)

schiacchia commented 7 years ago

Hinri,

Did you manage to resolve this issue on your own? I am currently struggling with the same error.

Thanks, S

ellendejong commented 6 years ago

Hi everyone,

I tried to figure out where the code rises the ERROR. The function readGEORawFile calls in the return-statement function GenomicMethylSet. This is where the first ERROR rises:

return(GenomicMethylSet(gr = gr[ind2, ], 
    Meth = mat[ind1, mindex], 
    Unmeth = mat[ind1, uindex], 
    pData = pData, 
    preprocessMethod = preprocessing,
    annotation = c(array = array, annotation = annotation)))

Error in .local(assays, ...) : 
  unused argument (pData = <S4 object of class "DataFrame">)

Unused argument. So I deleted pData as argument and tried to run the code again.

return(GenomicMethylSet(gr = gr[ind2, ], 
    Meth = mat[ind1, mindex], 
    Unmeth = mat[ind1, uindex], 
    preprocessMethod = preprocessing,
    annotation = c(array = array, annotation = annotation)))

Error in FUN(X[[i]], ...) : 
  assay colnames() must be NULL or equal colData rownames()

The function GenomicMethylSet calls at some point the function SummarizedExperiment.

The SummarizedExperiment class is a matrix-like container where rows represent features of interest (e.g. genes, transcripts, exons, etc...) and columns represent samples (with sample data summarized as a DataFrame). A SummarizedExperiment object contains one or more assays, each represented by a matrix-like object of numeric or other mode.

This function checks whether the colnames and/or rownames of the dataset are identical or NULL.

colnames <- colnames(x)
        test <- is.null(colnames) || identical(colnames, ans_colnames)
        if (!test)
            stop("assay colnames() must be NULL or equal colData rownames()")

        rownames <- rownames(x)
        test <- test &&
            is.null(rownames) || identical(rownames, ans_rownames)
        if (!test) {
            txt <- "assay rownames() must be NULL or equal rowData rownames() /
                    rowRanges names()"
            stop(paste(strwrap(txt, exdent=2), collapse="\n"))
        }

In case of dataset _GSE29290_MatrixSignal.txt, the colnames are based on the sampleName and whether it is _SignalA or _SignalB. Indeed the colnames are not equal.

head(colnames(mat))
[1] "Sample_1.Signal_A" "Sample_1.Signal_B" "Sample_2.Signal_A"
[4] "Sample_2.Signal_B" "Sample_3.Signal_A" "Sample_3.Signal_B"

Since the methylated values and unmethylated values are retrieved based on index of object mat when calling function GenomicMethylSet, the colnames can be changed. Right before calling the function GenomicMethylSet, I changed the colnames and used only SampleName (removed part of string after dot).

After those changes, I don't receive any ERROR messages. I still have to test and compare whether this change is safe and doesn't lead to any other problems. However, when the function SummarizedExperiment will be used, there is no other option than changing the colnames and/or rownames.

kasperdanielhansen commented 6 years ago

This is fixed in minfi 1.25.1. Reproducible example

library(GEOquery)
getGEOSuppFiles("GSE29290")
gmset=readGEORawFile("GSE29290/GSE29290_Matrix_Signal.txt",Uname="Signal_A",Mname="Signal_B",sep="\t")

This has been submitted to Github and Bioconductor build.

alexvnesta commented 6 years ago

Hello all,

I have a very similar/the same issue with the latest version:

mSetSq <- getGenomicRatioSetFromGEO(GSE = "GSE30654")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30654/matrix/
OK
Found 3 file(s)
GSE30654-GPL13534_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30654/matrix/GSE30654-GPL13534_series_matrix.txt.gz'
Content type 'application/x-gzip' length 309790493 bytes (295.4 MB)
==================================================
downloaded 295.4 MB

File stored at: 
/var/folders/14/zdpg3tzd1kq5_w7qybng_49r0000gn/T//RtmpQ6jdxd/GPL13534.soft
GSE30654-GPL6947_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30654/matrix/GSE30654-GPL6947_series_matrix.txt.gz'
Content type 'application/x-gzip' length 38255623 bytes (36.5 MB)
==================================================
downloaded 36.5 MB

File stored at: 
/var/folders/14/zdpg3tzd1kq5_w7qybng_49r0000gn/T//RtmpQ6jdxd/GPL6947.soft
GSE30654-GPL8490_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30654/matrix/GSE30654-GPL8490_series_matrix.txt.gz'
Content type 'application/x-gzip' length 33013631 bytes (31.5 MB)
==================================================
downloaded 31.5 MB

File stored at: 
/var/folders/14/zdpg3tzd1kq5_w7qybng_49r0000gn/T//RtmpQ6jdxd/GPL8490.soft
Error in .local(assays, ...) : 
  unused argument (pData = list(title = c(16, 17, 7, 2, 3, 1, 13, 11, 12, 9, 10, 4, 5, 6, 8, 15, 14, 18, 19, 23, 25, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 37, 36, 49, 48, 51, 50, 53, 52, 55, 54, 56, 58, 57, 39, 38, 40, 41, 43, 42, 45, 44, 47, 46, 60, 59, 33, 34, 32, 35, 63, 64, 66, 61, 62, 65, 67, 68, 69, 70, 71, 72, 90, 92, 80, 81, 75, 78, 79, 85, 82, 89, 88, 86, 87, 74, 83, 76, 77, 84, 91, 93, 73, 101, 132, 131, 147, 144, 145, 142, 146, 143, 103, 104, 114, 111, 112, 113, 110, 151, 150, 119, 116, 117, 118, 
115, 123, 121, 122, 120, 109, 106, 107, 108, 105, 134, 133, 149, 148, 141, 139, 140, 137, 138, 153, 152, 130, 126, 127, 128, 124, 129, 125, 100, 98, 96, 99, 97, 136, 135, 95, 94, 102), geo_accession = 1:153, status = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  not all columns named in 'colClasses' exist
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  not all columns named in 'colClasses' exist
3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  not all columns named in 'colClasses' exist
4: In getGenomicRatioSetFromGEO(GSE = "GSE30654") :
  More than one ExpressionSet found:
GSE30654-GPL13534_series_matrix.txt.gzGSE30654-GPL6947_series_matrix.txt.gzGSE30654-GPL8490_series_matrix.txt.gz
Using entry 1