harbourlab / SparK

Publication quality NGS track plotting
102 stars 20 forks source link

advice for plotting methylation data #15

Closed katiearacena closed 3 years ago

katiearacena commented 3 years ago

Hello again,

I am trying to visualize methylation data using sparK. I have bedgraph files with values ranging from 0 to 1. However when I plot the region it shows almost no signal though there are values ranging from 0 to 1. I have tried no smoothing and smoothing. Do you have any advice on getting this information to translate well onto the plot. My plot and code I used are below. Thank you!

sparK_plot

python SparK.py \
-pr chr4:103410000-103450000 \
-cf ${INPUT_DIR}/WGBS_EU_Flu_avg.bw.bedGraph \
-tf ${INPUT_DIR}/WGBS_EU_NI_avg.bw.bedGraph \
-tg 1 \
-cg 1 \
-gl WGBS \
-l Flu NI \
-gtf /ref/Homo_sapiens.GRCh37.Ensembl75.gtf \
-dg NFKB1 \
-cs 1.0 \
-o spark.NFKB1.WGBS
harbourlab commented 3 years ago

Hi! Have you tried plotting it without the „-cs“ option? If yes how does the output look like?

Sent from my iPhone

On Feb 2, 2021, at 4:03 PM, Katie Aracena notifications@github.com wrote:

 CAUTION: This email originated from outside the organization. DO NOT CLICK ON LINKS or OPEN ATTACHMENTS unless you know and trust the sender.

Hello again,

I am trying to visualize methylation data using sparK. I have bedgraph files with values ranging from 0 to 1. However when I plot the region it shows almost no signal though there are values ranging from 0 to 1. I have tried no smoothing and smoothing. Do you have any advice on getting this information to translate well onto the plot. My plot and code I used are below. Thank you!

[sparK_plot]https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F42899544%2F106662054-89086e00-6567-11eb-90d6-2e7e0ee44729.png&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7C72c88b9ab2594e0e0dfe08d8c7bdf76b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637478966002313535%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HDD0gMWt1qfF%2BJPHcdfX3k%2BpTwSWNCSvXc4UqOoRCJ8%3D&reserved=0

python SparK.py \ -pr chr4:103410000-103450000 \ -cf ${INPUT_DIR}/WGBS_EU_Flu_avg.bw.bedGraph \ -tf ${INPUT_DIR}/WGBS_EU_NI_avg.bw.bedGraph \ -tg 1 \ -cg 1 \ -gl WGBS \ -l Flu NI \ -gtf /ref/Homo_sapiens.GRCh37.Ensembl75.gtf \ -dg NFKB1 \ -cs 1.0 \ -o spark.NFKB1.WGBS

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fharbourlab%2FSparK%2Fissues%2F15&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7C72c88b9ab2594e0e0dfe08d8c7bdf76b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637478966002323528%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=amF7UGh5QQ8eC%2BeFWS77%2F08EiIe53aMtd6BsiUfg6fc%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHANDYQQBVA2WPZ2FM3R2G3S5BSBNANCNFSM4W7SZZ3A&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7C72c88b9ab2594e0e0dfe08d8c7bdf76b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637478966002323528%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uCM1ZhMUMkq8zN6eQgtI%2F%2FwipRMTine5BgZOKXr2RF8%3D&reserved=0.

katiearacena commented 3 years ago

yes! Here is the output. I have looked at the bedgraph file in that region and i know that there are values greater than 0.2 in this region.

spark_no_cs_flag

I wonder if the problem is the format of my bedgraph.

chr4 100000275 100000276 0.977952380952381 chr4 100000482 100000483 0.971 chr4 100000656 100000657 0.969238095238095 chr4 100000807 100000808 0.960619047619048 chr4 100000997 100000998 0.963095238095238 chr4 100001015 100001016 0.954047619047619

Perhaps the gaps between cpg sites are eliminating the signal?

Thanks!

harbourlab commented 3 years ago

Would you mind sharing the bedgraph file of that region? Just that region is enough, doesn’t have to be the entire one.

Sent from my iPhone

On Feb 2, 2021, at 6:25 PM, Katie Aracena notifications@github.com wrote:

 CAUTION: This email originated from outside the organization. DO NOT CLICK ON LINKS or OPEN ATTACHMENTS unless you know and trust the sender.

yes! Here is the output. I have looked at the bedgraph file in that region and i know that there are values greater than 0.2 in this region. [spark_no_cs_flag]https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F42899544%2F106675292-83685380-657a-11eb-9cb5-a3fab70ea476.png&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Ce92d8bda26134b889bed08d8c7d1c97b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637479051163036306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7RsBqJtSXxV6IXWIuKkqmq9feR1dtTBcVzG2yLbaFsI%3D&reserved=0

I wonder if the problem is the format of my bedgraph.

chr4 100000275 100000276 0.977952380952381 chr4 100000482 100000483 0.971 chr4 100000656 100000657 0.969238095238095 chr4 100000807 100000808 0.960619047619048 chr4 100000997 100000998 0.963095238095238 chr4 100001015 100001016 0.954047619047619

Perhaps the gaps between cpg sites are eliminating the signal?

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fharbourlab%2FSparK%2Fissues%2F15%23issuecomment-772084937&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Ce92d8bda26134b889bed08d8c7d1c97b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637479051163036306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1aFKW1kKHJv7lw0kgsrnOdWSLKs3Th1y%2BEz3fOvg%2BQ4%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHANDYQ3NXCLYN6BNNMZXQ3S5CCVPANCNFSM4W7SZZ3A&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Ce92d8bda26134b889bed08d8c7d1c97b%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637479051163046302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hbHt2zjWktSVCvqHo4oYAOYqe6wKwXxGx8TrX0%2BDyio%3D&reserved=0.

katiearacena commented 3 years ago

Here they are: chr4_sub.WGBS_EU_NI.bedGraph.txt chr4_sub.WGBS_EU_Flu.bedGraph.txt Thank you for your help!

harbourlab commented 3 years ago

Hi! No problem. This is actually a very interesting case. This is a problem of the data type. Methylation data is basically a number of sharp peaks, every one only one base pair long. If you plot a large region like 100 kb, then you have way more data points than pixels in your plot. There are different ways of dealing with that, and SparK will calculate the average methylation per pixel of the plot. Say you have to squeeze 10 data points into one pixel of the plot, and you only have one methylation site with a value of 1.0, then you will end up with an average value of 0.1. This is what you are experiencing here. Your data is there, just with lower values. However, I would actually consider this the better way of plotting things! See the attached example. I have plotted a larger region with IGV, you can see that the data ranges to 1.0, which is what you were expecting. However, the data is so highly clustered that it really doesn't tell you anything at all. See for instance the red square (see 1). If we zoom in once, we suddenly see that what looked as one massive block of methylation is actually two distinct sites (see 2), that appear to have the same amount of methylated sites. If we zoom in again, we notice that those two sites are not the same at all, but in fact very different in amount of methylation that is present (see 3). So even zooming in once doesn't give us meaningful data. Now look at the SparK plot as a comparison. While the values don't go up to 1, SparK plots meaningful data, even with the most zoomed out version. You can make out the peaks, and immediately see that this region contains a massive peak, aka accumulation of individual methylation sites, which you can not see with the IGV plot. (I would love to use this picture as an example on the SparK page, is that ok?)

Screen Shot 2021-02-03 at 9 56 41 PM
katiearacena commented 3 years ago

Hi, thanks for looking into what is happening - very interesting! Yes, you may use the plot on the page. Thanks for the help.

harbourlab commented 3 years ago

Thanks! Let me know if you need more help! :)

ifantasy commented 2 years ago

This function of Spark is meaning for representing the type of data, especially for regions with intensive or clustered points. However, this function can not show real distribution for our data. We have another type of methylation. It occurs sparsely for some region, so it is different from WGBS. See the example below:

image

The raw bedgraph data related to the eight points is listed as follows: chr4 55527803 55527804 100 chr4 55527872 55527873 97.6374 chr4 55528238 55528239 88.1444 chr4 55530128 55530129 85.4848 chr4 55530492 55530493 93.8322 chr4 55530552 55530553 77.9618 chr4 55530798 55530799 88.2967 chr4 55532045 55532046 90.9048

You can see that the 1st and 2nd point have significantly distinct height, but 2nd and 7th points have similar height. Also think about the 8th point. So, I wonder if there is a way just to show the raw data?

My command is python $(which SparK.py) -pr chr4:55527304-55532546 -cf test.bedgraph -ps all -gtf gencode.vM25.annotation.gtf -o test

harbourlab commented 2 years ago

Hi!

Yes, this is indeed a problem, as your data is only 1 bp wide, I will look into it.

Cheers


Stefan Kurtenbach, Ph.D. Research Assistant Professor Department of Ophthalmology, Bascom Palmer Eye Institute University of Miami Miller School of Medicine Sylvester Comprehensive Cancer Center Director, ISCI Pluripotent Stem Cell Laboratory

Biomedical Research Building Room 913 1501 NW 10th Avenue, Miami, FL 33136 305-243-3892 office, 786-252-4379 cell @.**@.>

On Jan 27, 2022, at 2:53 PM, ifantasy @.**@.>> wrote:

CAUTION: This email originated from outside the organization. DO NOT CLICK ON LINKS or OPEN ATTACHMENTS unless you know and trust the sender.

This function of Spark is meaning for representing the type of data, especially for regions with intensive or clustered points. However, this function can not show real distribution for our data. We have another type of methylation. It occurs sparsely for some region, so it is different from WGBS. See the example below:

[image]https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F8848913%2F151431500-b0bd59ac-9557-4e1c-a4f9-6933a7c3d44b.png&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Cca0eba1ac8fe43bc9d9208d9e1cebe7f%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637789100371621061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=kbMh1Q4llA%2FiMTZjj41SL3%2Byiv0s5B2mcWzv6kGIroc%3D&reserved=0

The raw bedgraph data related to the eight points is listed as follows: chr4 55527803 55527804 100 chr4 55527872 55527873 97.6374 chr4 55528238 55528239 88.1444 chr4 55530128 55530129 85.4848 chr4 55530492 55530493 93.8322 chr4 55530552 55530553 77.9618 chr4 55530798 55530799 88.2967 chr4 55532045 55532046 90.9048

You can see that the 1st and 2nd point have significantly distinct height, but 2nd and 7th points have similar height. Also think about the 8th point. So, I wonder if there is a way just to show the raw data?

My command is python $(which SparK.py) -pr chr4:55527304-55532546 -cf test.bedgraph -ps all -gtf gencode.vM25.annotation.gtf -o test

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fharbourlab%2FSparK%2Fissues%2F15%23issuecomment-1023585157&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Cca0eba1ac8fe43bc9d9208d9e1cebe7f%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637789100371621061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=57Uf%2BcH8x5GDekwnnW55CjLXzZoIYJoye8mqyiwH5D0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHANDYWDVDBAQVLBPDJ7MNDUYGPE7ANCNFSM4W7SZZ3A&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Cca0eba1ac8fe43bc9d9208d9e1cebe7f%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637789100371621061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oww8wOaeTsyK%2F4o8KonGqd7mTgRULO%2FJqEI7e79WgvA%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Cca0eba1ac8fe43bc9d9208d9e1cebe7f%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637789100371621061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=1pMQo6zCz2dqjHKIxX9azQJFcbdRNF881Iaw40Yd9og%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7CStefan.Kurtenbach%40med.miami.edu%7Cca0eba1ac8fe43bc9d9208d9e1cebe7f%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637789100371621061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k%2FjqIToF1m1d439XnJHgsrTRYGCVvgiNzunZw%2BUmsMQ%3D&reserved=0. You are receiving this because you modified the open/close state.Message ID: @.***>