Open ErikMarklund opened 2 years ago
Hi,
Sorry for the time it's taken for me to get back to you (been on holiday and moved house). This is unlikely to be an issue with perl. The perl code is just checking that the outputs for the various C programs are ok before it proceeds. If you look in the src/ directory you'll find assorted .c and .cpp files the problem likely arises in one or more of them. If you can find some places where the sequence length is a set number then you can likely fix this by monkey patching your disopred and recompiling the files
For instance I see in both disordcomb_pred.c and diso_neighb.c contain the line #define MAXSEQLEN 50000
I'd guess you can just change that for some other big number (i.e. 70000) and it should work. So if you track down as many similar Sequence Length (maxseqlen, seqlen, sequence_length) type things you can find, increase their sizes and then just recompile with:
cd src
make clean
make
make install
And, fingers crossed, it should work
C I know! Will try to change the macros to accommodate the titin sequences. Many thanks!
I will report back once I've tried it. Just need to wait a week or so until my current calculations end. Don't want to recompile mid-analysis.
Hi again,
I realised that the buffers were probably long enough, since they are 50000 by default and the sequences in question are about 32500 aa each. Because I reran my entire analysis using a larger reference database (uniref90), I also reran these long sequences too under the same conditions. This time I get another error:
ERROR: Different numbers of elements in the profile data structure and the array of disordered region lengths
which occurs for all three proteins.
It is not obvious to me what I can do to fix this, and I can live without these three proteins. But in case you are interested in digging deeper, the proteins in question have uniprot IDs A2ASS6, E9Q8K5, and E9Q8N1. I'd be happy to answer any questions about what I did to get this error, but I think it is pretty straightforward since I have not used anything unorthodox or modified anything.
run_disopred.pl fails for the three titin variants A2ASS6, E9Q8K5, and E9Q8N1 (uniprot accession codes). These are very long sequences, >30000 aa. No non-standard amino acids can be found in the sequences. See the output below.
My perl skills are just too weak to figure out what goes wrong, but the sequence length seem like a likely culprit. All other >55000 proteins in my dataset worked fine.