kad-ecoli / rna3db

maintain local copy of RNA structure database
0 stars 0 forks source link

Poor performance with mxfold2 #4

Closed marc-harary closed 3 years ago

marc-harary commented 3 years ago

I have evaluated mxfold2 on the PDB dataset, but the performance has abysmal, especially in comparison with SPOT-RNA. I was wondering if something was wrong with the my code. Here is the function I have been using to call and mxfold2 and parse its output. Note that it prints in dot bracket format.


def mxfoldPredict(input_path):
    """ Calls mxfold2 function and parses output. Returns the
    sequence,list of contacts represented as tuples between the ith
    and jth bases, and the number of predicted base pairs. """

    commandLst = ["mxfold2", "predict", input_path]
    output = subprocess.check_output(commandLst)
    outputStr = output.decode("utf-8")

    sequence = re.search("[AUCG]{2,}", outputStr).group()
    ct_str = re.search("[(.)]+", outputStr).group()

    stack = deque()
    ct_list = []
    for j, char in enumerate(ct_str):
        if char == "(":
            stack.append(j)
        if char == ")":
            i = stack.pop()
            ct_list.append((i, j))

    return sequence, ct_list, len(ct_list)```
kad-ecoli commented 3 years ago

I am not entirely sure what is the issue, as you did not include the full code to generate the final output. However, one possible issue is that in your code (i,j) is 0-indexed, while the label file provided in the SPOT-RNA PDB dataset is 1-indexed.

kad-ecoli commented 3 years ago

Also, you should change sequence = re.search("[AUCG]{2,}", outputStr).group() to sequence = re.search("[ATUCG]{2,}", outputStr).group() because apparent some (though just a small number) of the input sequence also has nucleotide type T.