incorrect sequence processing for PLATINUM database

jyaacoub commented 1 year ago

From commit https://github.com/jyaacoub/MutDTA/commit/0df4c92d9e5ff3274472b654da35bba169ee7fe2.

There is still the issue when applying mutation that needs to be addressed...

Instead of resetting the indices, we need to use the numbering found in the PDB file, since there will be missing residues not included and the mutation string is based on those indices.

jyaacoub commented 1 year ago

Note some pdb files do not match with the mutation specified...

assert ref == ref_actual or ref_actual == mut, \
                'Reference does not match sequence at position ' + \
                f'{res_num}: {ref} != {ref_actual}'

I had to add to the above assertion "ref_actual == mut" to account for this. This means that the ref sequence for platinum might not actually be the actual ref -> verify that this is the case.

Example of error if not accounted for:

jyaacoub commented 1 year ago

Downloading both mutated and native structures doesnt help since this just causes more headaches due to mismatching sequences across the different pdbs:

Code to download remaining mutated pdbs:

This code was used in the PlatinumDataset.download() fn.

        # download remaining pdbs from PDB site
        # missing pdbs will be those that are not wildtypes since that is the default
        # NOTE: some structures do not match up with sequence length of native structure.
        # to get around this these are ignored and we just use the native structure.
        def filter(row):
            mt = row['mut.mt_pdb']
            return mt != 'NO' and not \
                os.path.isfile(f'../data/PlatinumDataset/raw/platinum_pdb/{mt}.pdb')
        df_raw = pd.read_csv(self.raw_paths[0])
        Downloader.download_PDBs(df_raw[df_raw.apply(filter, axis=1)]['mut.mt_pdb'],
                                 save_dir=self.raw_paths[1])

jyaacoub / MutDTA

incorrect sequence processing for PLATINUM database #26

Code to download remaining mutated pdbs: