Failing for long sequences

ErikMarklund commented 2 years ago

run_disopred.pl fails for the three titin variants A2ASS6, E9Q8K5, and E9Q8N1 (uniprot accession codes). These are very long sequences, >30000 aa. No non-standard amino acids can be found in the sequences. See the output below.

E9Q8N1
Running PSI-BLAST search ...

Generating PSSM ...

Predicting disorder with DISOPRED2 ...

/domus/h1/marklund/src/disopred3/disopred/bin/disopred2 /domus/h1/marklund/calc_disorder/wd/E9Q8N1 /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx /domus/h1/marklund/src/disopred3/disopred/data/ 5 
mv /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso2 
Running neural network classifier ...

/domus/h1/marklund/src/disopred3/disopred/bin/diso_neu_net /domus/h1/marklund/src/disopred3/disopred/data/weights.dat.nmr_nonpdb /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.nndiso 
Running nearest neighbour classifier ...

/domus/h1/marklund/src/disopred3/disopred/bin/diso_neighb /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx /domus/h1/marklund/src/disopred3/disopred/data/dso.lst > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.dnb 
Combining disordered residue predictions ...

/domus/h1/marklund/src/disopred3/disopred/bin/combine /domus/h1/marklund/src/disopred3/disopred/data/weights_comb.dat /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso2 /domus/h1/marklund/calc_disorder/wd/E9Q8N1.nndiso /domus/h1/marklund/calc_disorder/wd/E9Q8N1.dnb > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso 
[/home/marklund/src/disopred3/disopred/run_disopred.pl] ERROR: Different numbers of elements in the profile data structure and the array of disordered region lengths

My perl skills are just too weak to figure out what goes wrong, but the sequence length seem like a likely culprit. All other >55000 proteins in my dataset worked fine.

DanBuchan commented 2 years ago

Hi,

Sorry for the time it's taken for me to get back to you (been on holiday and moved house). This is unlikely to be an issue with perl. The perl code is just checking that the outputs for the various C programs are ok before it proceeds. If you look in the src/ directory you'll find assorted .c and .cpp files the problem likely arises in one or more of them. If you can find some places where the sequence length is a set number then you can likely fix this by monkey patching your disopred and recompiling the files

For instance I see in both disordcomb_pred.c and diso_neighb.c contain the line #define MAXSEQLEN 50000 I'd guess you can just change that for some other big number (i.e. 70000) and it should work. So if you track down as many similar Sequence Length (maxseqlen, seqlen, sequence_length) type things you can find, increase their sizes and then just recompile with:

cd src
make clean
make
make install

And, fingers crossed, it should work

ErikMarklund commented 2 years ago

C I know! Will try to change the macros to accommodate the titin sequences. Many thanks!

ErikMarklund commented 2 years ago

I will report back once I've tried it. Just need to wait a week or so until my current calculations end. Don't want to recompile mid-analysis.

ErikMarklund commented 2 years ago

Hi again,

I realised that the buffers were probably long enough, since they are 50000 by default and the sequences in question are about 32500 aa each. Because I reran my entire analysis using a larger reference database (uniref90), I also reran these long sequences too under the same conditions. This time I get another error: ERROR: Different numbers of elements in the profile data structure and the array of disordered region lengths which occurs for all three proteins.

It is not obvious to me what I can do to fix this, and I can live without these three proteins. But in case you are interested in digging deeper, the proteins in question have uniprot IDs A2ASS6, E9Q8K5, and E9Q8N1. I'd be happy to answer any questions about what I did to get this error, but I think it is pretty straightforward since I have not used anything unorthodox or modified anything.

psipred / disopred

Failing for long sequences #4