kexinhuang12345 / DeepPurpose

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)
https://doi.org/10.1093/bioinformatics/btaa1005
BSD 3-Clause "New" or "Revised" License
974 stars 272 forks source link

error in GetSequenceOrderCouplingNumber #27

Closed jchartove closed 4 years ago

jchartove commented 4 years ago

When using the Quasi-seq encoding on the BindingDB dataset, I ran into the following error:

Drug Target Interaction Prediction Mode... in total: 1073803 drug-target pairs encoding drug... unique drugs: 549205 encoding protein... unique target sequence: 5078

KeyError Traceback (most recent call last)

in 1 train, val, test = utils.data_process(X_drugs, X_targets, y, 2 drug_encoding, target_encoding, ----> 3 split_method='cold_drug',frac=[0.7,0.1,0.2]) ~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method, frac, random_seed, sample_frac, mode, X_drug_, X_target_) 419 if DTI_flag: 420 df_data = encode_drug(df_data, drug_encoding) --> 421 df_data = encode_protein(df_data, target_encoding) 422 elif DDI_flag: 423 df_data = encode_drug(df_data, drug_encoding, 'SMILES 1', 'drug_encoding_1') ~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in encode_protein(df_data, target_encoding, column_name, save_column_name) 317 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]] 318 elif target_encoding == 'Quasi-seq': --> 319 AA = pd.Series(df_data[column_name].unique()).apply(GetQuasiSequenceOrder) 320 AA_dict = dict(zip(df_data[column_name].unique(), AA)) 321 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]] ~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 4198 else: 4199 values = self.astype(object)._values -> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype) 4201 4202 if len(mapped) and isinstance(mapped[0], Series): pandas\_libs\lib.pyx in pandas._libs.lib.map_infer() ~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder(ProteinSequence, maxlag, weight) 1908 """ 1909 result = dict() -> 1910 result.update(GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, _Distance1)) 1911 result.update(GetQuasiSequenceOrder2SW(ProteinSequence, maxlag, weight, _Distance1)) 1912 result.update( ~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, distancematrix) 1794 for i in range(maxlag): 1795 rightpart = rightpart + GetSequenceOrderCouplingNumber( -> 1796 ProteinSequence, i + 1, distancematrix 1797 ) 1798 AAC = GetAAComposition(ProteinSequence) ~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetSequenceOrderCouplingNumber(ProteinSequence, d, distancematrix) 1601 temp1 = ProteinSequence[i] 1602 temp2 = ProteinSequence[i + d] -> 1603 tau = tau + math.pow(distancematrix[temp1 + temp2], 2) 1604 return round(tau, 3) 1605 KeyError: 'mg'
kexinhuang12345 commented 4 years ago

Hi Julia, good catch, i think since quasi-seq is a heuristics algorithm, it breaks when the input has weird unexpected symbols. I just fixed it in the most recent commit.

kexinhuang12345 commented 4 years ago

Closing for now! please reopen if you find this issue still persists.