Closed lbwfff closed 1 year ago
Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.
Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.
Hi, twopin Thank you for your reply. Are Peptide_Intrinsic_dict_v3 and Protein_Intrinsic_dict files that use the same process but the input sequence is peptide and protein respectively? Besides that, I also encountered a problem, I'm using a small file to test, After changing fasta_filename to my input fasta file, I got the following error. How can I solve this problem?
(4, 16, 16)
(4, 0)
(4, 16, 16)
(4, 0)
Traceback (most recent call last):
File "step3_generate_features.py", line 57, in <module>
Intrinsic = raw_score_dict_long[key]
KeyError: '>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens OX=9606 GN=YWHAB PE=1 SV=3'
Thanks, LeeLee
I seem to output an empty raw_score_dict, why is this happening? The following is one of the files output by my IUPred. Is there any problem with this file?
# IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
# Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi
# Nucleic Acids Research 2018;46(W1):W329-W337.
#
# Prediction type: short
# Prediction output
# POS RES IUPRED2
1 M 0.9141
2 T 0.8713
3 M 0.8311
4 D 0.7458
5 K 0.6870
6 S 0.6650
7 E 0.6374
8 L 0.5711
9 V 0.5473
10 Q 0.5084
...........
hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files. The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result". cm4_pep_long.txt cm4_pep_short.txt
hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files. The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result". cm4_pep_long.txt cm4_pep_short.txt
Hi, Thanks for your reply, let me have solved this problem, there is still a small question, for this piece of code:
Intrinsic_score = {}
for seq in Intrinsic_score_short.keys():
Intrinsic = Intrinsic_score_long[prot_seq][:,0]
short_Intrinsic = Intrinsic_score_short[prot_seq]
concat_Intrinsic = np.column_stack((long_Intrinsic,short_Intrinsic))
Intrinsic_score[seq] = np.column_stack((long_Intrinsic,short_Intrinsic))
Here will report an error NameError: name'prot_seq' is not defined
, And the long_Intrinsic
here does not appear in the previous code, I guess it is Intrinsic
?
Hi, I encountered another troubles in preprocess_features.py, the following is my error:
(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
Traceback (most recent call last):
File "preprocess_features.py", line 137, in <module>
f = open(datafile)
NameError: name 'datafile' is not defined
I guess the datafile
here is equivalent to input_file
is this? After changing the datafile
to input_file
, I got the following error:
(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
Traceback (most recent call last):
File "preprocess_features.py", line 148, in <module>
feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
File "preprocess_features.py", line 49, in label_seq_ss
X[i] = res_ind[res]
TypeError: 'set' object has no attribute '__getitem__'
How can I solve this problem? look forward to your reply. Best wishes, LeeLee
It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.
It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.
So datafile
refers to the test_filename
in Data curation? This amino acid sequence is something else I printed, I guess the error should be due to this piece of code.
def label_seq_ss(line, pad_prot_len, res_ind):
line = line.strip().split(',')
X = np.zeros(pad_prot_len)
for i ,res in enumerate(line[:pad_prot_len]):
X[i] = res_ind[res]
return X
if pep_ss not in peptide_ss_feature_dict:
print(pep_ss)
print(pad_pep_len)
print(seq_ss_set)
feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
peptide_ss_feature_dict[pep_ss] = feature
The following is my pep_ss
, pad_pep_len
and seq_ss_set
:
"XC,YC,IE,QC,NC,CC,PC,LC,GC"
50
set(['"MC,VC,DC,RH,EH,QH,LH,VH,QH,KH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,NH,VH,TH,EC,LC,NC,EC,PC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,SC,AC,DC,GC,NC,EH,KH,KH,IH,EH,MH,VH,RH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,AH,VH,CH,QH,DH,VH,LH,SH,LH,LH,DH,NH,YH,LH,IH,KH,NH,CC,SC,EC,TC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,TC,GC,EH,KH,RH,AH,TH,VH,VH,EH,SH,SH,EH,KH,AH,YH,SH,EH,AH,HH,EH,IH,SH,KH,EH,HH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,YH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,HH,LH,AH,KH,TH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,DC,DC,GC,GC,EC,GC,NC,NC"', '"MC,GC,DC,RH,EH,QH,LH,LH,QH,RH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,SH,AH,MH,KH,AH,VH,TH,EH,LC,NC,EC,PC,LC,SC,NH,EH,DH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,MC,AC,DC,GC,NC,EH,KH,KH,LH,EH,KH,VH,KH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,TH,VH,CH,NH,DH,VH,LH,SH,LH,LH,DH,KH,FH,LH,IH,KC,NC,CC,NC,DC,FC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,SC,GC,EH,KH,KH,NH,SH,VH,VH,EH,AH,SH,EH,AH,AH,YH,KH,EH,AH,FH,EH,IH,SH,KH,EH,QH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,LH,LH,AH,KH,QH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,EC,EC,AC,GC,EC,GC,NC"', '"MC,DC,DC,RH,EH,DH,LH,VH,YH,QH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,EH,MH,VH,EH,SH,MH,KH,KH,VH,AH,GC,MC,DC,VC,EC,LC,TC,VH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,IH,GH,AH,RH,RH,AH,SH,WH,RH,IH,IH,SH,SH,IH,EH,QH,KH,EH,EC,NC,KC,GC,GC,EH,DH,KH,LH,KH,MH,IH,RH,EH,YH,RH,QH,MH,VH,EH,TH,EH,LH,KH,LH,IH,CH,CH,DH,IH,LH,DH,VH,LH,DH,KH,HH,LH,IH,PH,AH,AC,NC,TC,GH,EH,SH,KH,VH,FH,YH,YH,KH,MH,KH,GH,DH,YH,HH,RH,YH,LH,AH,EH,FH,AC,TC,GC,NH,DH,RH,KH,EH,AH,AH,EH,NH,SH,LH,VH,AH,YH,KH,AH,AH,SH,DH,IH,AH,MH,TH,EH,LC,PC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,DH,RH,AH,CH,RH,LH,AH,KH,AH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,SC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,MC,QC,GC,DC,GC,EC,EC,QH,NC,KC,EH,AH,LH,QH,DC,VC,EC,DC,EC,NC,QC"', '"MC,TC,MC,DC,KH,SH,EH,LH,VH,QH,KH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,AH,VH,TH,EH,QC,GC,HC,EC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,EC,RC,NC,EC,KH,KH,QH,QH,MH,GH,KH,EH,YH,RH,EH,KH,IH,EH,AH,EH,LH,QH,DH,IH,CH,NH,DH,VH,LH,EH,LH,LH,DH,KH,YH,LH,IH,PH,NH,AC,TC,QC,PH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,FH,RH,YH,LH,SH,EH,VC,AC,SC,GC,DH,NH,KH,QH,TH,TH,VH,SH,NH,SH,QH,QH,AH,YH,QH,EH,AH,FH,EH,IH,SH,KH,KH,EH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,EH,KH,AH,CH,SH,LH,AH,KH,TH,AH,FH,DH,EH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,EC,NC,QC,GC,DC,EC,GC,DC,AC,GC,EC,GC,EC,NC"'])
I think this bug is due to the naming of variable seq_ss_set twice. I just fixed the bug and revised the script.
Hi, According to Data curation, I need to format the peptide-protein data like "
protein sequence, peptide sequence, protein_ss, peptide_ss
", but in fact preprocess_features.py needs me to provide theProtein_pssm_dict, Protein_Intrinsic_dict and Peptide_Intrinsic_dict_v3
files, if I understand correctly. I can get pssm_dict according to step3_generate_features.py, but what are the next two files? Thanks, LeeLee