uci-cbcl / UFold

MIT License
58 stars 26 forks source link

length RNA #10

Open mtinti opened 2 years ago

mtinti commented 2 years ago

Hi, I'm getting an error when I try to predict RNA sequences longer than 600 bases:

Here is the error when I input sequences of 700 bases:

Welcome using UFold prediction tool!!! Traceback (most recent call last): File "/cluster/majf_lab/mtinti/UFold/ufold_predict.py", line 328, in main() File "/cluster/majf_lab/mtinti/UFold/ufold_predict.py", line 302, in main test_data = RNASSDataGenerator_input('data/', 'input') File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 217, in init self.load_data() File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 229, in load_data self.data_x = np.array([self.one_hot_600(item) for item in self.seq]) File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 229, in self.data_x = np.array([self.one_hot_600(item) for item in self.seq]) File "/cluster/majf_lab/mtinti/UFold/ufold/data_generator.py", line 244, in one_hot_600 one_hot_matrix_600[:len(seq_item),] = feat ValueError: could not broadcast input array from shape (700,4) into shape (600,4)

Is this expected? I thought I could go up to 1600bp...

Cheers Michele

sperfu commented 2 years ago

Hi Michele,

Thanks for reaching out. UFold could go up to 1600bp. But as the sequence gets too long, it will inevitably cost a lot memory usage and time to calculate for the final result during our training and testing process, it may also cause severe out-of-memory issue especially for our backend server. So to keep our backend from crashing down. We have deliberately limit the sequence length to 600bp to achieve the best efficiency and accuracy. Please understand that.

Nevertheless, we have also add one comment line in the data_generator.py file (line 244) as shown here: https://github.com/uci-cbcl/UFold/blob/174437f48167073dac4a7f794bc24b2623aed077/ufold/data_generator.py#L244 you may replace this line with 243 line to get the whole sequence length feature. But as I mentioned earlier, it may result in high calculation cost. So we still recommended the users to predict the sequence better within 900~1000nt(best is within 600bp), you may cut the sequence to multiple short ones for prediction.

Thanks

mtinti commented 2 years ago

Thanks for the speedy response! I'll try your suggestions.

cheers Michele