Closed demis001 closed 1 year ago
This sounds like a strand issue. Many tools report 2 (meth, cov) counts for each CpG coming from the forward and reverse strand, and we usually want to pool those values.
This usually works, so I am going to need some more detail on what exactly you're doing, to help you.
Best, Kasper
On Fri, Oct 27, 2023 at 10:55 AM Dereje Jima @.***> wrote:
@hansenlab https://github.com/hansenlab
I have been using bsmap -> MOABS for a long time and am recently interested in doing multivariate analysis in R. I did paired-end reads for 100 cases and control. Whenever I import the coverage file from Bismark (ADB_102_S45.cov.gz) to R using bsseq, I get twice the CpG in the human genome. i.e., a single CpG represented by two rows in the coverage file. I expected 29 million rows in the GRange object at most, but I got 58896438. 10469 and 10470 are the same CpG. Any idea how to fix this?
granges(bs.black) GRanges object with 58896438 ranges and 0 metadata columns: usernames ranges strand
[1] 1 10469 [2] 1 10470 [3] 1 10471 [4] 1 10472 [5] 1 10484 ... ... ... ... [58896434] MT 16496 [58896435] MT 16542 [58896436] MT 16543 [58896437] MT 16565 [58896438] MT 16566
@demis001 https://github.com/demis001
— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH64VAIE4PKTLXV4FC3YBPDM3AVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DKNRZHA4DOOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Best, Kasper
Thanks, Kasper, for the prompt reply. I tried to represent consecutive rows with a single row in each replicate file using a simple Python script. Even though my 100 files have less than 29 million rows, the merge GRange object differed from what I anticipated (still higher than 29 million). Is there a better approach to go from Bismark to bsseq to resolve this issue?
""" import sys
def merge_consecutive_lines(lines): merged_lines = []
def process_and_append(start_line, end_line):
max_col4 = max(float(line.split("\t")[3]) for line in lines[start_line:end_line + 1])
sum_col5 = sum(int(line.split("\t")[4]) for line in lines[start_line:end_line + 1])
sum_col6 = sum(int(line.split("\t")[5]) for line in lines[start_line:end_line + 1])
merged_lines.append("\t".join([lines[start_line].split("\t")[0], lines[start_line].split("\t")[1], lines[end_line].split("\t")[2], str(max_col4), str(sum_col5), str(sum_col6)]))
start_line = 0
for i in range(1, len(lines)):
current_line = lines[i].split("\t")
prev_line = lines[i - 1].split("\t")
if int(current_line[1]) != int(prev_line[2]) + 1:
process_and_append(start_line, i - 1)
start_line = i
process_and_append(start_line, len(lines) - 1)
return merged_lines
if name == "main":
#print("Usage: python merge_lines.py input.txt")
#sys.exit(1)
#input_file = sys.argv[1]
#with open(input_file, 'r') as file:
#lines = file.read().splitlines()
lines = sys.stdin.read().splitlines()
merged_lines = merge_consecutive_lines(lines)
for line in merged_lines:
# Print columns 1, 2, 4, 5, and 6
columns = line.split("\t")
print("\t".join([columns[0], columns[1], columns[3], columns[4], columns[5]]))
"""
Best, Dereje
I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.
So you're processing the bismark files in python? I would guess that's the issue. We have a function (and a manpage) which reads in the raw bismark output. I suggest you do that.
Best, Kasper
On Fri, Oct 27, 2023 at 12:27 PM Dereje Jima @.***> wrote:
I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.
— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783190770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4JMDCACFD4MWQKATDYBPOGPAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGE4TANZXGA . You are receiving this because you commented.Message ID: @.***>
-- Best, Kasper
No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.
I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.
Dereje
Yeah, so see if ?read.bismark helps you. It should contain clear statements on how to get from Bismark to bsseq, and if it doesn't, I am happy to amend the instructions. But from my POV you should never have to edit the raw output files. However, bismark can output stuff in a few different formats.
Best, Kasper
On Fri, Oct 27, 2023 at 1:11 PM Dereje Jima @.***> wrote:
No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.
I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.
Dereje
— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783246127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4V3Z5AMN7THCTQAUDYBPTLJAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGI2DMMJSG4 . You are receiving this because you commented.Message ID: @.***>
-- Best, Kasper
I am talking to @FelixKrueger, a developer of Bismark. He suggested running the "coverage2cytosine --merge_CpG" option. I don't know if this is the intended input for bsseq downstream analysis. I am running it right now to see how the coverage looks like. I will look into that.
Dereje
It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.
Sure, I mean, you could start by following my advice and read the man page and see if that clarifies things for you.
I do see the User's Guide section on reading data is in dire need of updating.
Best, Kasper
On Fri, Oct 27, 2023 at 2:07 PM Dereje Jima @.***> wrote:
It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.
— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/128#issuecomment-1783311420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHYIAHGHR4SQCF5O27TYBPZ6FAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGMYTCNBSGA . You are receiving this because you commented.Message ID: @.***>
-- Best, Kasper
Yes, I looked into the read.bismark(). files: It says, The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in .gz or .bz2 will be automatically decompressed to [tempfile](http://127.0.0.1:14081/help/library/bsseq/help/tempfile)(). We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details.
That is what I read in bsseq. I provided the *.cov.gz
file from Bismark methylation extractor. This represents each "C" in the genome not the aggregate CpG. from both strand.
In the MOABS pipeline, I always get a single row for C and G in the single CpG. In the differentially methylated "C" (DMC) file, I always get one row in the DMC file. In your case, I am getting a row for "C" and a row for "G" after processing. Here is a sample output from MOABS mcall:
#chrom start end totalC_0 nominalRatio_0 ratioCI_0 totalC_1 nominalRatio_1 ratioCI_1 nominalDif_1-0 credibleDif_1-0 difCI_1-0 p_sim_1_v_0 p_fet_1_v_0 class
1 10468 10470 24 0.583 0.398,0.769 20 0.75 0.562,0.938 0.167 0 -0.0722,0.365 0.0309 0.342 hypo
1 10470 10472 25 0.68 0.504,0.856 23 0.739 0.562,0.916 0.0591 0 -0.154,0.258 0.0576 0.756 hypo
1 10483 10485 27 0.815 0.66,0.969 32 0.906 0.781,1 0.0914 0 -0.0583,0.244 0.0556 0.45 hypo
@hansenlab
I have been using bsmap -> MOABS for a long time and am recently interested in doing multivariate analysis in R. I did paired-end reads for 100 cases and control. Whenever I import the coverage file from Bismark (ADB_102_S45.cov.gz) to R using bsseq, I get twice the CpG in the human genome. i.e., a single CpG represented by two rows in the coverage file. I expected 29 million rows in the GRange object at most, but I got 58896438. 10469 and 10470 are the same CpG. Any idea how to fix this?
@demis001