uci-cbcl / UFold

MIT License
58 stars 26 forks source link

original dataset for process_data_newdataset.py #5

Closed L0-zhang closed 2 years ago

L0-zhang commented 2 years ago

Hello. From the source code, I think the original dataset you use as the input for process_data_newdataset.py is different from that used in e2efold with ct format(https://drive.google.com/open?id=19KPRYJjjMJh1qdMhtmUoYA_ncw3ocAHc). Could you tell me your original data format? It's better to give an example. Thank you very much.

L0-zhang commented 2 years ago

Does 'self.data_y' have the same meaning with that in E2EFold,which is derived from dot_bracket to labels?

self.data_y = np.array([instance[1] for instance in self.data])

label_dict = { '.': np.array([1,0,0]), '(': np.array([0,1,0]), ')': np.array([0,0,1]) }

sperfu commented 2 years ago

Hi there,

Since the datasets we adopted are from multiple sources, besides using ct format file retrieved from e2efold, we also collected datasets from MXfold2 and SPOTRNA(bpRNA dataset), which mainly use bpseq files as input. The input file format is quite simple with three columns as shown below:

1 A 37 2 U 36 3 C 35 4 U 34 5 C 33 6 A 32 7 C 31

we used this file to generate further downstream file in our work.

As for your second question, Yes. 'self.data_y' have the same meaning with that in e2efold. However, self.data_y is derived from bpseq or ct format file instead of dot_bracket file, as the codes is shown in process_data_newdataset.py from here, https://github.com/uci-cbcl/UFold/blob/528533143e194854e264fcfd9802252c95f2f6b7/process_data_newdataset.py#L102 we first convert pairs into pair list, then use that pair info to generate label dict.

Thanks

L0-zhang commented 2 years ago

OK.Get it. Thank you very much for the detailed reply.