moss-lab / ScanFold

ScanFold is an RNA sequence scanning pipeline which attempts to identify potentially functional RNA secondary structures. This is done by first scanning a single input sequence to identify regions which generate negative thermodynamic z-scores (a hallmark of functional RNA sequences), and subsequently identifying the specific base pairs which were responsible for generating the low z-scores.
MIT License
13 stars 7 forks source link

RunTimeERROR: #16

Open Zjianglin opened 2 years ago

Zjianglin commented 2 years ago

Hello, Developer, I'm trying to use ScanFold to predict a 2D RNA Structure of a RNA sequences. However, I encounter a runtime error as below:

Making output folder named:FLAVs_UTR5_ScanFoldOut_08-16-2021-22.35.59_669
Output name=MN661082_UTR5.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: MN661082_UTR5
Elapsed time: 32.95s
Determining best base pairs...
Elapsed time: 33.07s
Detecting competing pairs...
Trying to write CT files with -c option
Elapsed time: 34.18s
Writing CT files
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
No constraints will be applied!
WARNING: vrna_fold_compound@data_structures.c: sequence length must be greater 0
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/", line 1735, in <module>
  File "/home/zhoujl/bin/ViennaRNA/lib/python3.8/site-packages/RNA/", line 3286, in hc_add_from_db
    return _RNA.fold_compound_hc_add_from_db(self, *args, **kwargs)
TypeError: in method 'fold_compound_hc_add_from_db', argument 1 of type 'vrna_fold_compound_t *'

My Conda environment is

conda list | egrep "rna|numpy|biopython"
biopython                 1.78             py38h7b6447c_0
infernal                  1.1.2                h516909a_3    bioconda
numpy                     1.18.5           py38ha1c710e_0    defaults
numpy-base                1.18.5           py38hde5b4d6_0
numpydoc                  1.1.0                      py_0
rnastructure              6.1                  he1b5a44_1    bioconda
tornado                   6.0.4            py38h7b6447c_1

Would you please help me ? btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you.

rjandr commented 2 years ago

This error looks to be related to the "Structure Extract" process. The script is looking for structures in the dot-bracket file but has found some that are "unbalanced". I will need to work on a fix for this, update, and let you know. Thanks for pointing it out!

rjandr commented 2 years ago
btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you._

I would not recommend using ScanFold for short sequences. But you could just set window size to be 1 nt shorter than the sequence length and set the step to 1? This would just create one or two windows to analyze the sequence

Zjianglin commented 2 years ago

Thanks for your reply. I would further consider another tool for my short sequences.

When I run ScanFold for some 300bp+ sequences analysis, I encounter the below runtime error for many times,

Scanning for k-mer: 538 to 557

Scanning for k-mer: 539 to 558

Scanning for k-mer: 540 to 559

Scanning for k-mer: 541 to 560

Scanning for k-mer: 542 to 561
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/", line 1028, in <module>
    meanz = float(statistics.mean(zscore_total))
  File "/home/zhoujl/anaconda3/lib/python3.8/", line 315, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

It seems the zscore_total is empty. However, I run the ScanFold with -w 120 -r 200 -t 37 --type di --lri --by_ed, the randomizations should be 200 times, Why the error occurs? Does ScanFold has any requirements for the input sequences? for example, length, base content, or N-base percentage?

rjandr commented 2 years ago

I would recommend not using the --lri or --by_ed. Those were features I was testing and wanted to share with others. They don't look stable, will have to remove them for now.

rjandr commented 2 years ago

Also, you may be interested in using IGV-ScanFold. We have been putting some work into creating a stable build of IGV which can run ScanFold directly on genomes, or you can load any fasta file as a genome and run ScanFold on it!

Zjianglin commented 2 years ago

Okay, I will try IGV-ScanFold later.

So what's the reason for causing the statistics.StatisticsError: mean requires at least one data point Error? Should I just remove the --lri --by_ed options and rerun?

rjandr commented 2 years ago

Yes, remove those options and rerun.

Zjianglin commented 2 years ago

Okay, Thanks for your patience.

By the way, it seems ScanFold can not process degenerate bases. For example, M=A+C, K=G+T。 When I run ScanFold with a sequences containing degenerate bases as input, it raised a KeyError.

scanfold --name demo --fold --out_name demoout -w 120 -r 200 -t 37 --type di --global_refold test.fa 
Making output folder named:demoout
Output name=KC181923_3'UTR.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: KC181923_3'UTR
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/", line 549, in <module>
    scrambled_sequences = scramble(frag, randomizations, type)
  File "/home/zhoujl/packages/ScanFold/", line 875, in scramble
    result = dinuclShuffle(frag)
  File "/home/zhoujl/packages/ScanFold/", line 264, in dinuclShuffle
    ok,edgeList,nuclList,lastCh = eulerian(s)
  File "/home/zhoujl/packages/ScanFold/", line 231, in eulerian
    nuclCnt,dinuclCnt,List = computeCountAndLists(s)
  File "/home/zhoujl/packages/ScanFold/", line 188, in computeCountAndLists
    nuclCnt[y] += 1; nuclTotal  += 1
KeyError: 'K'

less test.fa | grep "K"
>KC181923_3'UTR Aedes flavivirus|Aedes_flavivirus|[10120:11079](+)

There is a K in my sequence, but ScanFold failed to run. What should I do for this phenomenon? After all, many sequences have degenerate bases or N bases.

rjandr commented 2 years ago

See my answer about degenerate bases here Briefly, the dinucleotide shuffling algorithm does not allow degenerate bases (more sophisticated methods can probably be found elsewhere) but you can use "N" nucleotides with ScanFold using a mononucleotide shuffling (--type mono; default).

Zjianglin commented 2 years ago

Thank you.

rjandr commented 2 years ago

No problem, please let me know if you have any more issues!