Open Zjianglin opened 2 years ago
This error looks to be related to the "Structure Extract" process. The script is looking for structures in the dot-bracket file but has found some that are "unbalanced". I will need to work on a fix for this, update, and let you know. Thanks for pointing it out!
btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you._
I would not recommend using ScanFold for short sequences. But you could just set window size to be 1 nt shorter than the sequence length and set the step to 1? This would just create one or two windows to analyze the sequence
Thanks for your reply. I would further consider another tool for my short sequences.
When I run ScanFold for some 300bp+ sequences analysis, I encounter the below runtime error for many times,
Scanning for k-mer: 538 to 557
Scanning for k-mer: 539 to 558
Scanning for k-mer: 540 to 559
Scanning for k-mer: 541 to 560
Scanning for k-mer: 542 to 561
Traceback (most recent call last):
File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 1028, in <module>
meanz = float(statistics.mean(zscore_total))
File "/home/zhoujl/anaconda3/lib/python3.8/statistics.py", line 315, in mean
raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point
It seems the zscore_total
is empty. However, I run the ScanFold with -w 120 -r 200 -t 37 --type di --lri --by_ed
, the randomizations should be 200 times, Why the error occurs? Does ScanFold has any requirements for the input sequences? for example, length, base content, or N-base percentage?
I would recommend not using the --lri or --by_ed. Those were features I was testing and wanted to share with others. They don't look stable, will have to remove them for now.
Also, you may be interested in using IGV-ScanFold. https://github.com/ResearchIT/IGV-ScanFold We have been putting some work into creating a stable build of IGV which can run ScanFold directly on genomes, or you can load any fasta file as a genome and run ScanFold on it!
Okay, I will try IGV-ScanFold
later.
So what's the reason for causing the statistics.StatisticsError: mean requires at least one data point
Error? Should I just remove the --lri --by_ed
options and rerun?
Yes, remove those options and rerun.
Okay, Thanks for your patience.
By the way, it seems ScanFold
can not process degenerate bases
. For example, M=A+C, K=G+T
。
When I run ScanFold with a sequences containing degenerate bases
as input, it raised a KeyError.
scanfold --name demo --fold --out_name demoout -w 120 -r 200 -t 37 --type di --global_refold test.fa
/Storage/p3/test
Making output folder named:demoout
Output name=KC181923_3'UTR.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: KC181923_3'UTR
Traceback (most recent call last):
File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 549, in <module>
scrambled_sequences = scramble(frag, randomizations, type)
File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 875, in scramble
result = dinuclShuffle(frag)
File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 264, in dinuclShuffle
ok,edgeList,nuclList,lastCh = eulerian(s)
File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 231, in eulerian
nuclCnt,dinuclCnt,List = computeCountAndLists(s)
File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 188, in computeCountAndLists
nuclCnt[y] += 1; nuclTotal += 1
KeyError: 'K'
less test.fa | grep "K"
>KC181923_3'UTR Aedes flavivirus|Aedes_flavivirus|[10120:11079](+)
TTAGGGAGTTTGGAATACCTTTTCTATACCATAGATGCGC**K**GAAGCTTTAAAAATCGGG
There is a K
in my sequence, but ScanFold failed to run. What should I do for this phenomenon? After all, many sequences have degenerate bases or N
bases.
See my answer about degenerate bases here https://github.com/moss-lab/ScanFold/issues/17 Briefly, the dinucleotide shuffling algorithm does not allow degenerate bases (more sophisticated methods can probably be found elsewhere) but you can use "N" nucleotides with ScanFold using a mononucleotide shuffling (--type mono; default).
Thank you.
No problem, please let me know if you have any more issues!
Hello, Developer, I'm trying to use ScanFold to predict a 2D RNA Structure of a RNA sequences. However, I encounter a runtime error as below:
My Conda environment is
Would you please help me ? btw, If my sequence was less than 70bp, for example 30 bp or 50bp, what is the best value for the stepwise window size? Thank you.