Predicted .ct output file is in an incorrect format for few sequences

ml4bio / e2efold

pytorch implementation for "RNA Secondary Structure Prediction By Learning Unrolled Algorithms"

MIT License

108 stars 17 forks source link

Predicted .ct output file is in an incorrect format for few sequences #7

Closed jaswindersingh2 closed 4 years ago

jaswindersingh2 commented 4 years ago

Hi,

I was using E2Efold (productive) for prediction and it works perfectly alright except for few sequences where the predicted output in .ct file is not in the correct format. In the prediction of these sequences, some bases are paired with more than one nucleotide. For example, in the attached 5ddp_A.ct file, nucleotide no. 48 is paired with nucleotide no. 11, 12, 20, and 21.

It is possible that one nucleotide can pair with more than one other nucleotide in the sequence but for that .ct file format is not correct to represent the predicted output.

Can you please have a look at this issue?

5ddp_A.seq.txt 5ddp_A.ct.txt

Thank you

liyu95 commented 4 years ago

Dear Jaswinder:

Thank you very much for your interest in our tool!

It's a very insightful question, which is related to the algorithm design. As we discussed in the paper, we are solving this problem with an unrolled GD algorithm in the deep learning model. Since in this project, we fixed the number of GD iteration in the DL model, for a few sequences, in which the model is not very confident, that number of iteration may not enough to enforce the constraints and output a valid structure. Although on this problem, we did not implement the dynamic stopping algorithm, which could resolve the issue, we do have an extension project related to it. That is, we would learn when to stop the algorithm while we learn the model. Feel free to refer to our most recent paper:

Learning to Stop While Learning to Predict, ICML-2020, https://arxiv.org/pdf/2006.05082.pdf

jaswindersingh2 commented 4 years ago

Thanks for the explanation and now makes a complete sense of why some outputs like that.

Just one suggestion, if possible can E2Efold (productive) provide the predicted base-pair probability for a given input RNA. It will be helpful to see the confidence of the predicted base-pair.

Thank you

liyu95 commented 4 years ago

Just one suggestion, if possible can E2Efold (productive) provide the predicted base-pair probability for a given input RNA. It will be helpful to see the confidence of the predicted base-pair.

Yes, great suggestion! Our method can output that but we binarized the final results. We would change the API to accommodate such needs in the future.