rr1859 / R.4Cker

MIT License
16 stars 15 forks source link

About step 3. Removing the self-ligated and undigested fragments and a suggestion to pipelines #40

Open rvidal404 opened 6 years ago

rvidal404 commented 6 years ago

Dear Raviram,

I have a doubt about this step 3. Even removing the self-ligated and undigested fragments (using the command grep -i -B 1 "CTTCTATCTGCAGAGA" mm10_hindiii_flanking_sequences_25_unique_2.fa -but, replacing for my reduced genome and my primer) I still have a huge enrichment in the region adjacent to the bait in all samples. Therefore my question is do we need to get the coordinates just for one primer indeed (the sequencing one) or for both primers for each bait and remove both from our samples?

Also, I suggest to include a 3' removal step to exclude any eventual restriction site (for the 2nd enzyme) that may appear in the reads and decrease mapping efficacy.

Thank your for developing this amazing tool.

Vidal

rr1859 commented 6 years ago

Hi, that command actually only gives you the bait coordinate. This will give you the coordinates for the self-ligated and undigested fragments - grep -E -A 1 -B 1 'chr1.100025183.100025208' mm10_hindiii_flanking_sites_25_unique_2.bed

rvidal404 commented 6 years ago

Hi, Thank you for your prompt reply I have followed all steps demonstrated in this third step and I still have a huge enrichment close to the bait.

Changing subjects a little bit, I have noticed that when the function differentialAnalysis is called the output has a logFC - however, it is not possible to identify anymore which interactions those logFC belong to. I looked into your source code, but I couldn't manage to keep a track of interaction region (start and end or median positions). Would be possible to do this? If yes, can you just point where I should change the script to have the desired output (logFC linked to start and end of each interaction).

this the head of my nb_results: norm_counts_avg.Coord norm_counts_avg.Count norm_counts_avg.Condition window_counts.chr window_counts.start window_counts.end window_counts.distance window_counts.sample_1 window_counts.sample_2 1 18381104 5.57304436815389 condx chrx 15378333 15383875 998588 4 1 2 18383874.5 5.57304436815389 condx chrx 15381027 15386722 995817.5 4 1 3 18386722 16.7191331044617 condx chrx 15383798 15389646 992970 12 1 4 18389646 25.0786996566925 condx chrx 15386647 15392645 990046 18 136 5 18392644.5 1.39326109203847 condx chrx 15389577 15395712 987047.5 1 189 6 18395712 1.39326109203847 condx chrx 15392585 15398839 983980 1 27 7 18398839.5 2.78652218407694 condx chrx 15395666 15402013 980852.5 2 1 8 18402013 5.57304436815389 condx chrx 15398810 15405216 977679 4 1 9 18405216 12.5393498283463 condx chrx 15402004 15408428 974476 9 1 10 18408428 11.1460887363078 condx chrx 15405227 15411629 971264 8 1

this the head of my res_df (output of differentialAnalysis): baseMean log2FoldChange lfcSE stat pvalue padj 1 3.18198051533946 2.99997307721916 2.60644985159456 1.15098054749983 0.249740238369143 0.651206492827546 2 3.18198051533946 2.99997307721916 2.60644985159456 1.15098054749983 0.249740238369143 0.651206492827546 3 8.83883476483184 4.58491797794814 2.70290059341347 1.69629544982929 0.0898299417713957 0.507773636194102 4 60.8111831820431 -1.91753917117779 1.62121247101045 -1.18278091580597 0.236895985894916 0.651206492827546 5 67.5286976033153 -6.56220434109035 2.47287855319178 -2.65367028745525 0.00796215843389775 0.453459612242359 6 10.2530483272049 -3.75486433276936 2.5987441478122 -1.44487649387511 0.148492579962369 0.578338913383225 7 1.76776695296637 1.99998302436278 2.54093854962406 0.787104050453595 0.431220958489546 0.784988939822331 8 3.18198051533946 2.99997307721916 2.60644985159456 1.15098054749983 0.249740238369143 0.651206492827546 9 6.7175144212722 4.16988470653317 2.69730047893175 1.54594741635334 0.122117267750995 0.554177637092992 10 6.01040764008565 3.99996157414918 2.69012179887044 1.48690723811417 0.137039312315054 0.575922360323174

And my desired output would be "res_df" with 3 more columns :
c(nb_results$window_counts.chr, nb_results$window_counts.start, nb_results$window_counts.end )

rr1859 commented 6 years ago

Hi, I think you should visualize your data on IGV or some other browser first instead of looking at the R plot. Its quite normal to see a strong enrichment around the bait region in 4C-Seq even after removing the self-ligated and undigested fragments.

I altered line #84 in the differentalAnalysis.R script to now return the chr, strart,end as well. Please let me know if this is what you wanted.

rvidal404 commented 6 years ago

Hi, Indeed it was close to the bait - but it wasn't self-ligation. Thanks for that.

About the differentalAnalysis.R script, it is giving me the desired output now. Many thanks! However, it is not clear in this new script if the rows in res_df are lower to the pval that we specify when calling differentalAnalysis. There is a pval column where we are able to find pval higher than the pval specified in differentalAnalysis.

In the res_df below you can see that some values are higher than 0.05 -that was the pval that I set when I called the differentalAnalysis

chr start end baseMean log2FoldChange lfcSE stat pvalue padj
chr9 18378932 18381614 4.5 2.99996962304221 2.71245369909174 1.10599846332741 0.268727184493565 0.651256984484406
chr9 18380276 18382952 27 -2.52355767387697 2.12001193717084 -1.1903506907818 0.233908584951238 0.613841026403827
chr9 18381617 18384288 49.5 -6.61466141478202 2.70579269418708 -2.44462978593758 0.0145000899115541 0.422163869802272
chr9 18382955 18385621 88.5 -0.146844596276023 1.02445465495146 -0.143339283555679 0.886022232350914 1
chr9 18384291 18386951 64 -0.689074163686413 1.23183790916876 -0.559387041555976 0.575897603228711 0.882823156151917
chr9 18385624 18388279 18 -5.12923542319818 3.04608452861375 -1.68387822958165 0.0922052044895847 0.479292672568496
chr9 18386954 18389603 52 2.25632558551901 1.62603856108628 1.38762120377493 0.165252421292831 0.541247190858857
chr9 18388282 18390924 59.5 1.38186201662667 1.3794369638109 1.00175800190903 0.316460485698894 0.689203522178066
chr9 18389606 18392242 15.5 -4.90684440254354 3.06817782873561 -1.59926988474643 0.109760648013646 0.486720839969786
chr9 18390928 18393557 7.5 -3.8073212186489 2.97523186066119 -1.27967210521966 0.200660479019246 0.574165811386597
chr9 18392246 18394868 30.5 -5.90684152062457 2.88234602527432 -2.04931728141919 0.0404310998848282 0.456525381861643
chr9 18393561 18396176 115 -3.24100607993002 1.47766177154572 -2.19333418671293 0.0282833102630962 0.438972288261129
chr9 18394872 18397480 93.5 -0.807357493420376 1.06392451668188 -0.758848471636246 0.447943206111298 0.791308231807507
chr9 18396180 18398779 56.5 -0.437409210001731 1.28123725632261 -0.341395949769034 0.732805523776201 0.978464612415184
chr9 18397484 18400074 33.5 -0.935873588063228 1.69909634648425 -0.550806662611998 0.581766216382134 0.887343289903295
chr9 18398784 18401365 14.5 1.39229919402981 2.27532015077835 0.611913533817879 0.540594966863611 0.862396441871187
chr9 18400079 18402651 19 -1.68805574844929 2.19243916126686 -0.769944169157185 0.441333011667457 0.787418127793151
chr9 18401370 18403932 28 5.78129197677393 2.91410397857736 1.9839003753038 0.0472669470873451 0.456525381861643
chr9 18402656 18405207 53.5 0.0809149840911005 1.2994292098878 0.0622696361413077 0.950348108385323 1
chr9 18403937 18406477 59.5 0.365644236082994 1.24444979146833 0.293819998677143 0.768895452520913 0.997049102762333
chr9 18405213 18407741 27.5 5.7548197367131 2.92079140121269 1.9702946723014 0.0488046086826223 0.456525381861643
chr9 18406483 18408998 23.5 -5.52351303875995 2.97589092817281 -1.85608719273572 0.0634411309417172 0.467818796361246

differentalAnalysis specifications:

res_df = differentialAnalysis(obj=my_obj,
                              norm_counts_avg=nb_results$norm_counts_avg,
                              windows=nb_results$window_counts,
                              conditions=c("1","2"),
                              region="nearbait",
                              coordinates=NULL,
                              pval=0.05)

What the pval in the res_df is referring to then (as there are pval > 0.05)? Thanks,

rr1859 commented 6 years ago

It the stats for all the windows used in the comparison. You will have to filter this for the p < 0.05 or alter accordingly.