twopin / CAMP

predicting peptide-protein interactions
117 stars 30 forks source link

About Proprecess data #12

Closed lbwfff closed 1 year ago

lbwfff commented 2 years ago

Hi, According to Data curation, I need to format the peptide-protein data like "protein sequence, peptide sequence, protein_ss, peptide_ss", but in fact preprocess_features.py needs me to provide the Protein_pssm_dict, Protein_Intrinsic_dict and Peptide_Intrinsic_dict_v3 files, if I understand correctly. I can get pssm_dict according to step3_generate_features.py, but what are the next two files? Thanks, LeeLee

twopin commented 2 years ago

Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.

lbwfff commented 2 years ago

Hi,you need to generate pssm features by PSI-Blast and intrinsic features by IUPred with your own data.

Hi, twopin Thank you for your reply. Are Peptide_Intrinsic_dict_v3 and Protein_Intrinsic_dict files that use the same process but the input sequence is peptide and protein respectively? Besides that, I also encountered a problem, I'm using a small file to test, After changing fasta_filename to my input fasta file, I got the following error. How can I solve this problem?

(4, 16, 16)
(4, 0)
(4, 16, 16)
(4, 0)
Traceback (most recent call last):
  File "step3_generate_features.py", line 57, in <module>
    Intrinsic = raw_score_dict_long[key]
KeyError: '>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens OX=9606 GN=YWHAB PE=1 SV=3'

Thanks, LeeLee

lbwfff commented 2 years ago

I seem to output an empty raw_score_dict, why is this happening? The following is one of the files output by my IUPred. Is there any problem with this file?

# IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
# Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi
# Nucleic Acids Research 2018;46(W1):W329-W337.
#
# Prediction type: short
# Prediction output
# POS   RES IUPRED2
1   M   0.9141
2   T   0.8713
3   M   0.8311
4   D   0.7458
5   K   0.6870
6   S   0.6650
7   E   0.6374
8   L   0.5711
9   V   0.5473
10  Q   0.5084
...........
twopin commented 2 years ago
  1. https://github.com/twopin/CAMP/issues/12#issuecomment-992278927: Yes, Peptide_Intrinsic_dict_v3 and Protein_Intrinsic_dict use the same code for generation (difference input files, one for protein and on for peptide). 'KeyError: '>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens OX=9606 GN=YWHAB PE=1 SV=3'' should be the fasta name in your fasta file. The error message indicates that there is s sequence in your sequence in your fasta file whose fasta name is not in the key list of the raw_score_dict. I suspect that there is something wrong in the function 'extract_intrinsic_disorder'. You can run the function line by line and print the two dicts to check.
twopin commented 2 years ago

hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files. The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result". cm4_pep_long.txt cm4_pep_short.txt

lbwfff commented 2 years ago

hi, here are two of the results I got when I wrote these codes 1 years ago (I'm not sure if the output format changes recently). The fasta names for each sequence can be changed according to the input fasta files. The original file name is ended with ".result". I change them just for file uploading (github don'e recognize files ending with ''.result". cm4_pep_long.txt cm4_pep_short.txt

Hi, Thanks for your reply, let me have solved this problem, there is still a small question, for this piece of code:

Intrinsic_score = {}
for seq in Intrinsic_score_short.keys():
    Intrinsic = Intrinsic_score_long[prot_seq][:,0]
    short_Intrinsic = Intrinsic_score_short[prot_seq]
    concat_Intrinsic = np.column_stack((long_Intrinsic,short_Intrinsic))
    Intrinsic_score[seq] = np.column_stack((long_Intrinsic,short_Intrinsic))

Here will report an error NameError: name'prot_seq' is not defined, And the long_Intrinsic here does not appear in the previous code, I guess it is Intrinsic?

lbwfff commented 2 years ago

Hi, I encountered another troubles in preprocess_features.py, the following is my error:

(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv 
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
Traceback (most recent call last):
  File "preprocess_features.py", line 137, in <module>
    f = open(datafile)
NameError: name 'datafile' is not defined

I guess the datafilehere is equivalent to input_fileis this? After changing the datafileto input_file, I got the following error:

(camp) leelee@ubuntu-PowerEdge-T440:~/tools/CAMP/testforadjust$ python -u preprocess_features.py test.tsv 
test.tsv
num of peptides 3 pad_pep_len 50
seq_set 4 pad_prot_len 247
num of peptide ss 3 pad_pep_len 50
seq_ss_set 4 pad_prot_len 247
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
Traceback (most recent call last):
  File "preprocess_features.py", line 148, in <module>
    feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
  File "preprocess_features.py", line 49, in label_seq_ss
    X[i] = res_ind[res]
TypeError: 'set' object has no attribute '__getitem__'

How can I solve this problem? look forward to your reply. Best wishes, LeeLee

twopin commented 2 years ago

It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.

twopin commented 2 years ago

https://github.com/twopin/CAMP/issues/12#issuecomment-993079930: Yes

lbwfff commented 2 years ago

It seems that the key 'MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN' is not in the dict? Actually I'm not sure about that, you can use the debug function to check.

So datafilerefers to the test_filenamein Data curation? This amino acid sequence is something else I printed, I guess the error should be due to this piece of code.

def label_seq_ss(line, pad_prot_len, res_ind):
    line = line.strip().split(',')
    X = np.zeros(pad_prot_len)
    for i ,res in enumerate(line[:pad_prot_len]):
        X[i] = res_ind[res]
    return X

        if pep_ss not in peptide_ss_feature_dict:
            print(pep_ss)
            print(pad_pep_len)
            print(seq_ss_set)
            feature = label_seq_ss(pep_ss, pad_pep_len, seq_ss_set)
            peptide_ss_feature_dict[pep_ss] = feature

The following is my pep_ss, pad_pep_lenand seq_ss_set:

"XC,YC,IE,QC,NC,CC,PC,LC,GC"
50
set(['"MC,VC,DC,RH,EH,QH,LH,VH,QH,KH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,NH,VH,TH,EC,LC,NC,EC,PC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,SC,AC,DC,GC,NC,EH,KH,KH,IH,EH,MH,VH,RH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,AH,VH,CH,QH,DH,VH,LH,SH,LH,LH,DH,NH,YH,LH,IH,KH,NH,CC,SC,EC,TC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,TC,GC,EH,KH,RH,AH,TH,VH,VH,EH,SH,SH,EH,KH,AH,YH,SH,EH,AH,HH,EH,IH,SH,KH,EH,HH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,YH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,HH,LH,AH,KH,TH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,DC,DC,GC,GC,EC,GC,NC,NC"', '"MC,GC,DC,RH,EH,QH,LH,LH,QH,RH,AH,RH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,SH,AH,MH,KH,AH,VH,TH,EH,LC,NC,EC,PC,LC,SC,NH,EH,DH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,MC,AC,DC,GC,NC,EH,KH,KH,LH,EH,KH,VH,KH,AH,YH,RH,EH,KH,IH,EH,KH,EH,LH,EH,TH,VH,CH,NH,DH,VH,LH,SH,LH,LH,DH,KH,FH,LH,IH,KC,NC,CC,NC,DC,FC,QC,YH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,YH,RH,YH,LH,AH,EH,VH,AC,SC,GC,EH,KH,KH,NH,SH,VH,VH,EH,AH,SH,EH,AH,AH,YH,KH,EH,AH,FH,EH,IH,SH,KH,EH,QH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,QC,NC,AC,PH,EH,QH,AH,CH,LH,LH,AH,KH,QH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,DC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,QC,QC,DC,EC,EC,AC,GC,EC,GC,NC"', '"MC,DC,DC,RH,EH,DH,LH,VH,YH,QH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,EH,MH,VH,EH,SH,MH,KH,KH,VH,AH,GC,MC,DC,VC,EC,LC,TC,VH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,IH,GH,AH,RH,RH,AH,SH,WH,RH,IH,IH,SH,SH,IH,EH,QH,KH,EH,EC,NC,KC,GC,GC,EH,DH,KH,LH,KH,MH,IH,RH,EH,YH,RH,QH,MH,VH,EH,TH,EH,LH,KH,LH,IH,CH,CH,DH,IH,LH,DH,VH,LH,DH,KH,HH,LH,IH,PH,AH,AC,NC,TC,GH,EH,SH,KH,VH,FH,YH,YH,KH,MH,KH,GH,DH,YH,HH,RH,YH,LH,AH,EH,FH,AC,TC,GC,NH,DH,RH,KH,EH,AH,AH,EH,NH,SH,LH,VH,AH,YH,KH,AH,AH,SH,DH,IH,AH,MH,TH,EH,LC,PC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,DH,RH,AH,CH,RH,LH,AH,KH,AH,AH,FH,DH,DH,AH,IH,AH,EC,LH,DH,TH,LC,SC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,DC,MC,QC,GC,DC,GC,EC,EC,QH,NC,KC,EH,AH,LH,QH,DC,VC,EC,DC,EC,NC,QC"', '"MC,TC,MC,DC,KH,SH,EH,LH,VH,QH,KH,AH,KH,LH,AH,EH,QH,AC,EC,RC,YH,DH,DH,MH,AH,AH,AH,MH,KH,AH,VH,TH,EH,QC,GC,HC,EC,LC,SC,NH,EH,EH,RH,NH,LH,LH,SH,VH,AH,YH,KH,NH,VH,VH,GH,AH,RH,RH,SH,SH,WH,RH,VH,IH,SH,SH,IH,EH,QH,KH,TC,EC,RC,NC,EC,KH,KH,QH,QH,MH,GH,KH,EH,YH,RH,EH,KH,IH,EH,AH,EH,LH,QH,DH,IH,CH,NH,DH,VH,LH,EH,LH,LH,DH,KH,YH,LH,IH,PH,NH,AC,TC,QC,PH,EH,SH,KH,VH,FH,YH,LH,KH,MH,KH,GH,DH,YH,FH,RH,YH,LH,SH,EH,VC,AC,SC,GC,DH,NH,KH,QH,TH,TH,VH,SH,NH,SH,QH,QH,AH,YH,QH,EH,AH,FH,EH,IH,SH,KH,KH,EH,MC,QC,PC,TC,HC,PH,IH,RH,LH,GH,LH,AH,LH,NH,FH,SH,VH,FH,YH,YH,EH,IH,LC,NC,SC,PH,EH,KH,AH,CH,SH,LH,AH,KH,TH,AH,FH,DH,EH,AH,IH,AH,EC,LH,DH,TH,LC,NC,EC,EC,SC,YH,KH,DH,SH,TH,LH,IH,MH,QH,LH,LH,RH,DH,NH,LH,TH,LH,WH,TC,SC,EC,NC,QC,GC,DC,EC,GC,DC,AC,GC,EC,GC,EC,NC"'])
twopin commented 1 year ago

I think this bug is due to the naming of variable seq_ss_set twice. I just fixed the bug and revised the script.