liu-bioinfo-lab / EPCOT

17 stars 3 forks source link

what "label embedding" is? #4

Open zhichunlizzx opened 4 months ago

zhichunlizzx commented 4 months ago

Hello, after reading your article I still have some questions about what "label embedding" is. The introduction of "label embedding" in the thesis is not much. Is it another kind of sequencing data besides DNase-seq or ATAC-seq?

zhichunlizzx commented 4 months ago

The "label embedding" in Figure.1b is CTCF, RAD21 or histone modification ChIP-seq. Is the data derived from a publicly available data set or the model's previous predictions?

zzh24zzh commented 4 months ago

The "label embedding" in Figure.1b is CTCF, RAD21 or histone modification ChIP-seq. Is the data derived from a publicly available data set or the model's previous predictions?

Hello,

The label embeddings in our model are initialized as random parameters. These embeddings are designed to undergo updates during the training process. To understand this better, you can refer to the following line of code:

self.query_embed = nn.Embedding(num_class, hidden_dim)

In this context, num_class represents the total number of epigenomic features we aim to predict. The order of epigenomic features, such as CTCF, RAD21, etc., in the figure denotes their respective indices within the embedding list.

I hope this answers your questions.

zhichunlizzx commented 4 months ago

Thanks for your reply, now I understand a lot

zhichunlizzx commented 3 months ago

HI,how the ROC curve and AUC of two bigwig signals in this paper are evaluated? Whether to take the two bigwig signals directly as input to sklearn.metrics.roc_curve?

zzh24zzh commented 3 months ago

HI,how the ROC curve and AUC of two bigwig signals in this paper are evaluated? Whether to take the two bigwig signals directly as input to sklearn.metrics.roc_curve?

Sorry, I didn’t fully understand your question. Which figure in the manuscript are you talking about? If your are evaluating the ability of predicted signals to capture ChIP-seq peaks, we actually use the predicted signals and binary peak data as inputs.

zhichunlizzx commented 3 months ago

like Figure 5A, B, C, E, F

zhichunlizzx commented 3 months ago

I have a question about the evaluation metric "mse1imp" used in Figure 2B: the description of "mse1imp" says that the top 1% position of the predicted data is evaluated. Were EPCOT or Avocado predictions used in determining these positions?

zzh24zzh commented 3 months ago

like Figure 5A, B, C, E, F

In the enhancer activity prediction task, we predict the binary STARR-seq peaks instead of the signals, so the model outputs the probability indicating the likelihood of a peak.

zzh24zzh commented 3 months ago

I have a question about the evaluation metric "mse1imp" used in Figure 2B: the description of "mse1imp" says that the top 1% position of the predicted data is evaluated. Were EPCOT or Avocado predictions used in determining these positions?

The genomic positions used in the 'mse1imp' evaluation metric, are determined by the predicted signals. This metric is defined in the paper 'https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02915-y'.

zhichunlizzx commented 3 months ago

mse1imp Thanks for your reply, I found the calculation method of this evaluation index.