skrakau / PureCLIP

Capturing protein-RNA interaction footprints from single-nucleotide CLIP-seq data
GNU General Public License v3.0
27 stars 8 forks source link

Resuls file description #9

Closed anmej closed 4 years ago

anmej commented 5 years ago

Hello. Can you please explain what the data in the BED results file mean? In a line like this:

chr20 73636 73637 3 11.54 + [score_CL=25.1365;score_E=11.54;score_B=36.6765;score_UC=11.54]

What is the meaning of column 5 (11.54) and the numbers between the square brackets ([score_CL=25.1365;score_E=11.54;score_B=36.6765;score_UC=11.54])?

Thank you.

skrakau commented 4 years ago

Hi,

in column 5 is the default score at position t, i.e. the log posterior probability ratio of the first and second most likely state, given the observed data D (fragment densities and read start counts): score_(UC:unconditional) = ln(P(S_t = enriched + crosslink | D=d)/P(S_t = 2nd most likely state| D=d))

In column 7 is score_(E:enrichment focused) = ln(P(S_t = enriched + crosslink | D=d)/P(S_t = non-enriched + crosslink| D=d))

score_(CL:crosslink focused) = ln(P(S_t = enriched + crosslink | D=d)/P(S_t = enriched + non-crosslink| D=d))

score_(B:balanced) = ln(P(S_t = enriched | D=d)/P(S_t = non-enriched | D=d)) + ln(P(S_t = crosslink | D=d)/P(S_t = non-crosslink | D=d))

It depends on the binding characteristics of the protein, which score is best. If PureCLIP is run in a mode to correct for biases (input control and CL-motifs) and no prior knowledge is given about the binding characteristics, then the balanced score might be a good choice. Without bias correction I would use the default score.