Format of clip_data from TSV

JSLJ23 commented 2 years ago

Hi Zhang lab,

I wanted to get a better understanding of the format of the TSV clip data from the link provided via https://zhanglabnet.oss-cn-beijing.aliyuncs.com/prismnet/data/clip_data.tgz. The first column seems to show the Emsemble ID, followed by the sequence of basepairs. The next column seems slightly ambiguous so I am guessing it is the icShape analysed data? Then there are two columns that follow with single numbers. May I know what these last two columns are? And also where is the eCLIP binding data actually stored? Is it a per base pair binding score or an overall binary binding label of either 1 or 0?

Thank you.

kuixu commented 2 years ago

Yes, you are right. These two columns are the binding score (from POSTAR) and the binding label (1 or 0).

JSLJ23 commented 2 years ago

Thank you for getting back promptly. Ok so from what I understand, the binding score is aggregated over the entire length of 100bp and not any specific region? If that is the case, how was the analysis in Figure 4 performed whereby the region specific binding was depicted? Are the scores in those diagrams binding probability for the entire sequence? Also, from the generate data shell script, there is an option for generate the data in a binary or non binary manner, for most of the paper, the scores are continuous so do I need to change the option of is_bin == "0" to reproduce the results in the study?

kuixu commented 2 years ago

That's right! You could change the option of is_bin = "0" to test the continuous case.

The length of binding regions (collected by POSTAR database) are mostly ~30bp (with the binding score ), we extend to 101bp in this study. I directly copy the binding site data collection and processing strategy from the PrismNet paper below.


RBP binding site data collection and processing
RBP binding sites from CLIP-seq were collected from POSTAR and ENCODE (eCLIP). In total, we collected 269 CLIP-seq datasets for 56 RBPs from POSTAR as well as 392 eCLIP datasets for 134 RBPs from ENCODE. To ensure that the CLIP data sets used in our study are of high-quality and consistent, we downloaded the binding sites from the ENCODE project and a published database (POSTAR), in which the binding sites have been generated using a uniform pipeline; that is, we did not use the binding site data from the original publications (as called by different labs using various tools, pipelines, and parameters).
For a RBP with CLIP experiments in different cell lines, we constructed a PrismNet model for each cell line separately. For a RBP with multiple CLIP experiments in one cell line, we only chose one experiment of the highest quality (first filtered by the number of experiment replicates, then ranked by average sequencing depth among replicates). For any CLIP experiment of more than one replicate, we combined the overlap binding sites from all replicates. Specifically, we performed replicate normalizations, summing up and then merging them to use all the information of each replicate. The scores of each peak were normalized to [0, 1]. Overlapping binding peaks (at least 1nt) were then merged with the summed peak signals to yield a single peak.

We defined each resulted peak as a binding site. The length of a binding site was unified to 101 nt, where a region shorter than 101 nt was expanded from the middle to both sides and a longer region was cut from both sides to the middle. Finally, the top 5000 binding sites with the highest signals and at least 40% icSHAPE scores coverage were kept for the training and testing of PrismNet, as positive samples.

JSLJ23 commented 2 years ago

I see, thank you for the really detailed explanation, it makes much more sense to me now. One final question, from the study it was also shown the PrismNet trained on IGF2BP1 binding in K562-CLIP was able to generalize and accurately predict peaks on the EIF3F transcipt in HepG2 cells. From a broader prespective, is this model able to generalize and accurately predict binding on other cell lines which might be more distant like SHSY5Y or RD, given icSHAPE data in experimentally obtained in those cell lines? I think the rationale for asking this is to get a better understanding as to how far PrismNet is able to generalize the learned relationships between RBP binding and RNA structure and to what extents can inferences be made? Not really thinking of a clear cut obective threshold but a general sensing...

kuixu / PrismNet

Format of clip_data from TSV #2