orca233 / DeepSS2GO

6 stars 0 forks source link

q #2

Open JNU-luyi opened 4 weeks ago

JNU-luyi commented 4 weeks ago

I followed your method to generate secondary structure sequences for proteins in cafa3 and then performed one-hot encoding on them to create features. However, the prediction performance is not satisfactory. Do you have any insights into what might be causing this?

orca233 commented 3 weeks ago

The best performance comes from three contributions: ss8 + aa + Diamond. Could you provide more details so I can help?

JNU-luyi commented 3 weeks ago

I am trying to replicate the results from your paper. In your paper, you compared your model with others using the benchmark dataset CAFA3. The general logic of your model is to first predict the secondary structure through a deep learning model, then encode both the sequence and secondary structure using one-hot encoding, and finally weight the predictions of these two along with the sequence alignment results. Following the code you provided, I first generated the secondary structure for the CAFA3 sequences. After one-hot encoding it, I found the performance to be unsatisfactory, with an MF fmax of only 0.45. I am not sure what the reason is, and I wonder if you could make the secondary structure data of CAFA3 public.

orca233 commented 3 weeks ago

CAFA3 Secondary structure: https://huggingface.co/orca233/DeepSS2GO/tree/main/CAFA3_data

JNU-luyi commented 3 weeks ago

image I also followed the code you provided, used the CAFA3 data to generate the secondary structure, and then performed one-hot encoding, but the results were very poor. I am very puzzled about this.

orca233 commented 3 weeks ago

You need to modify the script a bit for CAFA3 data preparation. After step2, you will get CAFA3_train/test_data_clean_aa.pkl. Then carry out step3-8 for train and test, separately, and you will get those secondary files in the link from last message.

JNU-luyi commented 3 weeks ago

I proceeded in this way: I just checked the CAFA3 ss8 data you sent me, and it's basically the same as what I generated myself. The only difference is that you chose to delete sequences longer than 1024, while I chose to truncate them, so my training set has a few more samples than yours. But I don't feel that these differences would greatly affect the results. In your training process with the CAFA3 dataset, if you only use ss8, what is the approximate fmax of the prediction results? I only used ss8, and the fmax is only around 0.45.

orca233 commented 3 weeks ago

With ss8 feature only may not perform well, as shown in the paper table 3. You can try different combination (aa/ss8/diamond)

JNU-luyi commented 3 weeks ago

I know that with only ss8, the effect is definitely not as good as the combination of aa/ss8/diamond, but the performance is too poor, with the MF fmax score barely at 0.45. In any case, I mean that the effect of trying with only ss8 should not be this bad. Do you still remember how the effect of ss8 alone was in the CAFA3 data during your experiments? The ablation study you mentioned in Table 3 was probably not targeted at the CAFA3 dataset. ![Uploading image.png…]()