moss-lab / ScanFold

ScanFold is an RNA sequence scanning pipeline which attempts to identify potentially functional RNA secondary structures. This is done by first scanning a single input sequence to identify regions which generate negative thermodynamic z-scores (a hallmark of functional RNA sequences), and subsequently identifying the specific base pairs which were responsible for generating the low z-scores.
https://peerj.com/articles/6136/
MIT License
13 stars 7 forks source link

RunTimeERROR: #16

Open Zjianglin opened 2 years ago

Zjianglin commented 2 years ago

Hello, Developer, I'm trying to use ScanFold to predict a 2D RNA Structure of a RNA sequences. However, I encounter a runtime error as below:

Making output folder named:FLAVs_UTR5_ScanFoldOut_08-16-2021-22.35.59_669
Output name=MN661082_UTR5.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: MN661082_UTR5
Elapsed time: 32.95s
Determining best base pairs...
Elapsed time: 33.07s
Detecting competing pairs...
Trying to write CT files with -c option
Elapsed time: 34.18s
Writing CT files
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
<<<.......<...................).......<.........................................................((((((......))))))
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
<<.......<...................).......<.........................................................((((((......)))))).>
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
<.......<...................).......<.........................................................((((((......)))))).>>
No constraints will be applied!
WARNING: vrna_hc_add_from_db: Unbalanced brackets in constraint string
<...................).......<.........................................................((((((......)))))).>>>
No constraints will be applied!
WARNING: vrna_fold_compound@data_structures.c: sequence length must be greater 0
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 1735, in <module>
    fc.hc_add_from_db(str(es.structure))
  File "/home/zhoujl/bin/ViennaRNA/lib/python3.8/site-packages/RNA/__init__.py", line 3286, in hc_add_from_db
    return _RNA.fold_compound_hc_add_from_db(self, *args, **kwargs)
TypeError: in method 'fold_compound_hc_add_from_db', argument 1 of type 'vrna_fold_compound_t *'

My Conda environment is

conda list | egrep "rna|numpy|biopython"
biopython                 1.78             py38h7b6447c_0    https://mirrors.ustc.edu.cn/anaconda/pkgs/main
infernal                  1.1.2                h516909a_3    bioconda
numpy                     1.18.5           py38ha1c710e_0    defaults
numpy-base                1.18.5           py38hde5b4d6_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
numpydoc                  1.1.0                      py_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
rnastructure              6.1                  he1b5a44_1    bioconda
tornado                   6.0.4            py38h7b6447c_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

Would you please help me ? btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you.

rjandr commented 2 years ago

This error looks to be related to the "Structure Extract" process. The script is looking for structures in the dot-bracket file but has found some that are "unbalanced". I will need to work on a fix for this, update, and let you know. Thanks for pointing it out!

rjandr commented 2 years ago
btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you._

I would not recommend using ScanFold for short sequences. But you could just set window size to be 1 nt shorter than the sequence length and set the step to 1? This would just create one or two windows to analyze the sequence

Zjianglin commented 2 years ago

Thanks for your reply. I would further consider another tool for my short sequences.

When I run ScanFold for some 300bp+ sequences analysis, I encounter the below runtime error for many times,

Scanning for k-mer: 538 to 557

Scanning for k-mer: 539 to 558

Scanning for k-mer: 540 to 559

Scanning for k-mer: 541 to 560

Scanning for k-mer: 542 to 561
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 1028, in <module>
    meanz = float(statistics.mean(zscore_total))
  File "/home/zhoujl/anaconda3/lib/python3.8/statistics.py", line 315, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

It seems the zscore_total is empty. However, I run the ScanFold with -w 120 -r 200 -t 37 --type di --lri --by_ed, the randomizations should be 200 times, Why the error occurs? Does ScanFold has any requirements for the input sequences? for example, length, base content, or N-base percentage?

rjandr commented 2 years ago

I would recommend not using the --lri or --by_ed. Those were features I was testing and wanted to share with others. They don't look stable, will have to remove them for now.

rjandr commented 2 years ago

Also, you may be interested in using IGV-ScanFold. https://github.com/ResearchIT/IGV-ScanFold We have been putting some work into creating a stable build of IGV which can run ScanFold directly on genomes, or you can load any fasta file as a genome and run ScanFold on it!

Zjianglin commented 2 years ago

Okay, I will try IGV-ScanFold later.

So what's the reason for causing the statistics.StatisticsError: mean requires at least one data point Error? Should I just remove the --lri --by_ed options and rerun?

rjandr commented 2 years ago

Yes, remove those options and rerun.

Zjianglin commented 2 years ago

Okay, Thanks for your patience.

By the way, it seems ScanFold can not process degenerate bases. For example, M=A+C, K=G+T。 When I run ScanFold with a sequences containing degenerate bases as input, it raised a KeyError.

scanfold --name demo --fold --out_name demoout -w 120 -r 200 -t 37 --type di --global_refold test.fa 
/Storage/p3/test
Making output folder named:demoout
Output name=KC181923_3'UTR.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: KC181923_3'UTR
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 549, in <module>
    scrambled_sequences = scramble(frag, randomizations, type)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 875, in scramble
    result = dinuclShuffle(frag)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 264, in dinuclShuffle
    ok,edgeList,nuclList,lastCh = eulerian(s)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 231, in eulerian
    nuclCnt,dinuclCnt,List = computeCountAndLists(s)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 188, in computeCountAndLists
    nuclCnt[y] += 1; nuclTotal  += 1
KeyError: 'K'

less test.fa | grep "K"
>KC181923_3'UTR Aedes flavivirus|Aedes_flavivirus|[10120:11079](+)
TTAGGGAGTTTGGAATACCTTTTCTATACCATAGATGCGC**K**GAAGCTTTAAAAATCGGG

There is a K in my sequence, but ScanFold failed to run. What should I do for this phenomenon? After all, many sequences have degenerate bases or N bases.

rjandr commented 2 years ago

See my answer about degenerate bases here https://github.com/moss-lab/ScanFold/issues/17 Briefly, the dinucleotide shuffling algorithm does not allow degenerate bases (more sophisticated methods can probably be found elsewhere) but you can use "N" nucleotides with ScanFold using a mononucleotide shuffling (--type mono; default).

Zjianglin commented 2 years ago

Thank you.

rjandr commented 2 years ago

No problem, please let me know if you have any more issues!