sartorlab / methylSig

R package for DNA methylation analysis
17 stars 5 forks source link

methylSig not working for test data #49

Closed mapostolides closed 3 years ago

mapostolides commented 3 years ago

Hello,

I have tried running the following code which is directly from the readme on this github

files = c(
    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

 bs = bsseq::read.bismark(
    files = files,
    colData = data.frame(row.names = c('test1','test2')),
    rmZeroCov = FALSE,
    strandCollapse = FALSE
)

bs = filter_loci_by_coverage(bs, min_count = 5, max_count = 500)

pData(bsseq_stranded)
windowed_bs = tile_by_windows(bs = bs, win_size = 10000)

diff_gr = diff_methylsig(
    bs = windowed_bs,
    group_column = 'Type',
    comparison_groups = c('case' = 'cancer', 'control' = 'normal'),
    disp_groups = c('case' = TRUE, 'control' = TRUE),
    local_window_size = 0,
    t_approx = TRUE,
    n_cores = 1)

diff_gr

However, I am getting the following error upon running the diff_methylsig function:

Error in diff_methylsig(bs = windowed_bs, group_column = "Type", comparison_groups = c(case = "cancer", : group_column: Type not in column names of pData(bs):

I am not sure what this "group_column" is or what "Type" is, as there seems to be no existing "Type" column in the data. Am I missing something?

Thanks, Michael

rcavalcante commented 3 years ago

Hi Michael,

I'm following along from the vignette on Bioconductor. I think the part that's missing from your code is here. We load in a different test dataset than what was used in the previous vignette sections.

# Load data for use in the rest of the vignette
data(BS.cancer.ex, package = 'bsseqData')
bs = BS.cancer.ex[1:10000]

bs = filter_loci_by_coverage(bs, min_count = 5, max_count = 500)

In particular, now there is a Type column in the colData of the bs object:

> pData(bs)
DataFrame with 6 rows and 2 columns
          Type        Pair
   <character> <character>
C1      cancer       pair1
C2      cancer       pair2
C3      cancer       pair3
N1      normal       pair1
N2      normal       pair2
N3      normal       pair3

I apologize if that was unclear, and thanks for your interest in methylSig.

Raymond

mapostolides commented 3 years ago

Thanks Raymond !

I did indeed miss those lines. I've added them, and now I see the following:

pData(bs)

DataFrame with 6 rows and 2 columns
Type        Pair
<character> <character>
C1      cancer       pair1
C2      cancer       pair2
C3      cancer       pair3
N1      normal       pair1
N2      normal       pair2
N3      normal       pair3

It seems that the BSSeq object is being taken from a pre-loaded test dataset BS.cancer.ex. It seems this meta-data of Type is required for methylSig to work.

files = c(
    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

When I check their contents, this is what I see:

pData(bs)

DataFrame with 2 rows and 0 columns

How might I add this metadata and get methylSig working when starting with files instead of a pre-loaded dataset?

Thanks!

rcavalcante commented 3 years ago

Hi Michael,

Let's begin at the top. For the purposes of demonstrating the bsseq::read.bismark function in the vignette, I use small data that doesn't have any associated phenotype data. It doesn't need any phenotype data because I'm not going to use it for any of the other functions in the package. To demonstrate the other functions I use the BS.cancer.ex data that came prepackaged. So here is that small data:

files = c(
    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

bsseq_stranded = bsseq::read.bismark(
    files = files,
    colData = data.frame(row.names = c('test1','test2')),
    rmZeroCov = FALSE,
    strandCollapse = FALSE
)

If one were to do pData(bsseq_stranded), you would see:

DataFrame with 2 rows and 0 columns

This is because in the bsseq::read.bismark call I assigned

colData = data.frame(row.names = c('test1','test2'))

This is an empty data.frame. To assign phenotype data to this example we could just as easily have said:

files = c(
    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

bsseq_stranded = bsseq::read.bismark(
    files = files,
    colData = data.frame(condition = c('case', 'control'), sex = c('Male', 'Female'), row.names = c('test1','test2')),
    rmZeroCov = FALSE,
    strandCollapse = FALSE
)

Then

> pData(bsseq_stranded)
DataFrame with 2 rows and 2 columns
        condition         sex
      <character> <character>
test1        case        Male
test2     control      Female

How might I add this metadata and get methylSig working when starting with files instead of a pre-loaded dataset?

You provide a data.frame to the colData parameter of bsseq::read.bismark that describes the files you're reading. Please read the documentation by doing ?bsseq::read.bismark. The rows of colData correspond to the columns of the BSseq object and its matrices (down to the names). You can then use the column names in pData(object) to tell the differential methylation test functions how to group the data. In the BS.cancer.ex example, Type was particular to that data, but every dataset may be different.

Hope that helps, Raymond

mapostolides commented 3 years ago

Thanks Raymond, that's helpful! I really appreciate the thorough reply.

mapostolides commented 3 years ago

Hi Raymond,

I now have my code set up as follows:

    system.file('extdata', 'bis_cov1.cov', package='methylSig'),
    system.file('extdata', 'bis_cov2.cov', package='methylSig')
)

 bs = bsseq::read.bismark(
    files = files,
    colData = data.frame(condition = c('case', 'control'), sex = c('Male', 'Female'), row.names = c('test1','test2')),
    rmZeroCov = FALSE,
    strandCollapse = FALSE
)

bs = filter_loci_by_coverage(bs, min_count = 5, max_count = 500)

windowed_bs = tile_by_windows(bs = bs, win_size = 10000)

pData(windowed_bs)

diff_gr = diff_methylsig(
    bs = windowed_bs,
    group_column = 'condition',
    comparison_groups = c('case' = 'case', 'control' = 'control'),
    disp_groups = c('case' = TRUE, 'control' = TRUE),
    local_window_size = 0,
    t_approx = TRUE,
    n_cores = 1)

As expected, my metadata is now as follows:

DataFrame with 2 rows and 2 columns
        condition         sex
      <character> <character>
test1        case        Male
test2     control      Female

However, now I am getting a different error when running the diff_methylsig function:

Error in base::rowSums(meth_mat[, case_idx]) : 'x' must be an array of at least two dimensions

Any idea why this might be happening?

Thanks!

rcavalcante commented 3 years ago

Hi Michael,

Again, the data you're using is not intended to be used for differential methylation testing. I made up the conditions to illustrate to you how you'd add phenotype data to your own data. Moreover, that data doesn't have replicates per groups, which will cause the internal calculations of the test for differential methylation to fail.

Please use the BS.cancer.ex if you want to try out the tests for differential methylation. That's what that dataset was intended for.

Raymond