hansenlab / bsseq

Devel repository for bsseq
36 stars 26 forks source link

Represent a CpG with single row during import #128

Closed demis001 closed 1 year ago

demis001 commented 1 year ago

@hansenlab

I have been using bsmap -> MOABS for a long time and am recently interested in doing multivariate analysis in R. I did paired-end reads for 100 cases and control. Whenever I import the coverage file from Bismark (ADB_102_S45.cov.gz) to R using bsseq, I get twice the CpG in the human genome. i.e., a single CpG represented by two rows in the coverage file. I expected 29 million rows in the GRange object at most, but I got 58896438. 10469 and 10470 are the same CpG. Any idea how to fix this?

granges(bs.black) GRanges object with 58896438 ranges and 0 metadata columns: usernames ranges strand

[1] 1 10469 * [2] 1 10470 * [3] 1 10471 * [4] 1 10472 * [5] 1 10484 * ... ... ... ... [58896434] MT 16496 * [58896435] MT 16542 * [58896436] MT 16543 * [58896437] MT 16565 * [58896438] MT 16566 *

@demis001

kasperdanielhansen commented 1 year ago

This sounds like a strand issue. Many tools report 2 (meth, cov) counts for each CpG coming from the forward and reverse strand, and we usually want to pool those values.

This usually works, so I am going to need some more detail on what exactly you're doing, to help you.

Best, Kasper

On Fri, Oct 27, 2023 at 10:55 AM Dereje Jima @.***> wrote:

@hansenlab https://github.com/hansenlab

I have been using bsmap -> MOABS for a long time and am recently interested in doing multivariate analysis in R. I did paired-end reads for 100 cases and control. Whenever I import the coverage file from Bismark (ADB_102_S45.cov.gz) to R using bsseq, I get twice the CpG in the human genome. i.e., a single CpG represented by two rows in the coverage file. I expected 29 million rows in the GRange object at most, but I got 58896438. 10469 and 10470 are the same CpG. Any idea how to fix this?

granges(bs.black) GRanges object with 58896438 ranges and 0 metadata columns: usernames ranges strand

[1] 1 10469 [2] 1 10470 [3] 1 10471 [4] 1 10472 [5] 1 10484 ... ... ... ... [58896434] MT 16496 [58896435] MT 16542 [58896436] MT 16543 [58896437] MT 16565 [58896438] MT 16566

@demis001 https://github.com/demis001

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH64VAIE4PKTLXV4FC3YBPDM3AVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DKNRZHA4DOOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best, Kasper

demis001 commented 1 year ago

Thanks, Kasper, for the prompt reply. I tried to represent consecutive rows with a single row in each replicate file using a simple Python script. Even though my 100 files have less than 29 million rows, the merge GRange object differed from what I anticipated (still higher than 29 million). Is there a better approach to go from Bismark to bsseq to resolve this issue?

""" import sys

def merge_consecutive_lines(lines): merged_lines = []

def process_and_append(start_line, end_line):
    max_col4 = max(float(line.split("\t")[3]) for line in lines[start_line:end_line + 1])
    sum_col5 = sum(int(line.split("\t")[4]) for line in lines[start_line:end_line + 1])
    sum_col6 = sum(int(line.split("\t")[5]) for line in lines[start_line:end_line + 1])
    merged_lines.append("\t".join([lines[start_line].split("\t")[0], lines[start_line].split("\t")[1], lines[end_line].split("\t")[2], str(max_col4), str(sum_col5), str(sum_col6)]))

start_line = 0

for i in range(1, len(lines)):
    current_line = lines[i].split("\t")
    prev_line = lines[i - 1].split("\t")

    if int(current_line[1]) != int(prev_line[2]) + 1:
        process_and_append(start_line, i - 1)
        start_line = i

process_and_append(start_line, len(lines) - 1)

return merged_lines

if name == "main":

if len(sys.argv) != 2:

    #print("Usage: python merge_lines.py input.txt")
    #sys.exit(1)

#input_file = sys.argv[1]

#with open(input_file, 'r') as file:
    #lines = file.read().splitlines()
lines = sys.stdin.read().splitlines()

merged_lines = merge_consecutive_lines(lines)

for line in merged_lines:
    # Print columns 1, 2, 4, 5, and 6
    columns = line.split("\t")
    print("\t".join([columns[0], columns[1], columns[3], columns[4], columns[5]]))

"""

Best, Dereje

demis001 commented 1 year ago

I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.

kasperdanielhansen commented 1 year ago

So you're processing the bismark files in python? I would guess that's the issue. We have a function (and a manpage) which reads in the raw bismark output. I suggest you do that.

Best, Kasper

On Fri, Oct 27, 2023 at 12:27 PM Dereje Jima @.***> wrote:

I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783190770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4JMDCACFD4MWQKATDYBPOGPAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGE4TANZXGA . You are receiving this because you commented.Message ID: @.***>

-- Best, Kasper

demis001 commented 1 year ago

No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.

I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.

Dereje

kasperdanielhansen commented 1 year ago

Yeah, so see if ?read.bismark helps you. It should contain clear statements on how to get from Bismark to bsseq, and if it doesn't, I am happy to amend the instructions. But from my POV you should never have to edit the raw output files. However, bismark can output stuff in a few different formats.

Best, Kasper

On Fri, Oct 27, 2023 at 1:11 PM Dereje Jima @.***> wrote:

No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.

I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.

Dereje

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783246127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4V3Z5AMN7THCTQAUDYBPTLJAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGI2DMMJSG4 . You are receiving this because you commented.Message ID: @.***>

-- Best, Kasper

demis001 commented 1 year ago

I am talking to @FelixKrueger, a developer of Bismark. He suggested running the "coverage2cytosine --merge_CpG" option. I don't know if this is the intended input for bsseq downstream analysis. I am running it right now to see how the coverage looks like. I will look into that.

Dereje

demis001 commented 1 year ago

It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.

kasperdanielhansen commented 1 year ago

Sure, I mean, you could start by following my advice and read the man page and see if that clarifies things for you.

I do see the User's Guide section on reading data is in dire need of updating.

Best, Kasper

On Fri, Oct 27, 2023 at 2:07 PM Dereje Jima @.***> wrote:

It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783311420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHYIAHGHR4SQCF5O27TYBPZ6FAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGMYTCNBSGA . You are receiving this because you commented.Message ID: @.***>

-- Best, Kasper

demis001 commented 1 year ago

Yes, I looked into the read.bismark(). files: It says, The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in .gz or .bz2 will be automatically decompressed to [tempfile](http://127.0.0.1:14081/help/library/bsseq/help/tempfile)(). We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details. That is what I read in bsseq. I provided the *.cov.gz file from Bismark methylation extractor. This represents each "C" in the genome not the aggregate CpG. from both strand.

demis001 commented 1 year ago

In the MOABS pipeline, I always get a single row for C and G in the single CpG. In the differentially methylated "C" (DMC) file, I always get one row in the DMC file. In your case, I am getting a row for "C" and a row for "G" after processing. Here is a sample output from MOABS mcall:

#chrom  start   end totalC_0    nominalRatio_0  ratioCI_0   totalC_1    nominalRatio_1  ratioCI_1   nominalDif_1-0  credibleDif_1-0 difCI_1-0   p_sim_1_v_0 p_fet_1_v_0 class
1   10468   10470   24  0.583   0.398,0.769 20  0.75    0.562,0.938 0.167   0   -0.0722,0.365   0.0309  0.342   hypo
1   10470   10472   25  0.68    0.504,0.856 23  0.739   0.562,0.916 0.0591  0   -0.154,0.258    0.0576  0.756   hypo
1   10483   10485   27  0.815   0.66,0.969  32  0.906   0.781,1 0.0914  0   -0.0583,0.244   0.0556  0.45    hypo