moss-lab / ScanFold

ScanFold is an RNA sequence scanning pipeline which attempts to identify potentially functional RNA secondary structures. This is done by first scanning a single input sequence to identify regions which generate negative thermodynamic z-scores (a hallmark of functional RNA sequences), and subsequently identifying the specific base pairs which were responsible for generating the low z-scores.
https://peerj.com/articles/6136/
MIT License
15 stars 8 forks source link

Degenerate Bases #17

Closed rjandr closed 3 years ago

rjandr commented 3 years ago

From @Zjianglin :

ScanFold can not process degenerate bases. For example, M=A+C, K=G+T。 When I run ScanFold with a sequences containing degenerate bases as input, it raised a KeyError.

scanfold --name demo --fold --out_name demoout -w 120 -r 200 -t 37 --type di --global_refold test.fa 
/Storage/p3/test
Making output folder named:demoout
Output name=KC181923_3'UTR.win_120.stp_1.rnd_200.shfl_di
Scanning input sequence: KC181923_3'UTR
Traceback (most recent call last):
  File "/home/zhoujl/packages/ScanFold/ScanFold.py", line 549, in <module>
    scrambled_sequences = scramble(frag, randomizations, type)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 875, in scramble
    result = dinuclShuffle(frag)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 264, in dinuclShuffle
    ok,edgeList,nuclList,lastCh = eulerian(s)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 231, in eulerian
    nuclCnt,dinuclCnt,List = computeCountAndLists(s)
  File "/home/zhoujl/packages/ScanFold/ScanFoldFunctions.py", line 188, in computeCountAndLists
    nuclCnt[y] += 1; nuclTotal  += 1
KeyError: 'K'

less test.fa | grep "K"
>KC181923_3'UTR Aedes flavivirus|Aedes_flavivirus|[10120:11079](+)
TTAGGGAGTTTGGAATACCTTTTCTATACCATAGATGCGC**K**GAAGCTTTAAAAATCGGG

There is a K in my sequence, but ScanFold failed to run. What should I do for this phenomenon? After all, many sequences have degenerate bases or N bases.

Originally posted by @Zjianglin in https://github.com/moss-lab/ScanFold/issues/16#issuecomment-900720685

rjandr commented 3 years ago

@Zjianglin Moving this into a separate issue.

ScanFold allows "N" but the underlying algorithm will effectively ignore the nucleotide as a potential pairing partner. You can replace degenerate nucleotides with "N" and ScanFold will run ONLY using mononucleotide shuffling (--type mono; on by default). Currently, the dinucleotide shuffling algorithm can not handle degenerate bases.