Improve davis and kiba MSA input files

A lot of the matching sequences contain >25% gaps, we should redo MSA with hhblits using >uniref30.

Code:

##############################################################################
##############################################################################
# MSA CLUSTERING ISSUES:
##############################################################################
##############################################################################
# %%
from src.utils.af_clust import AF_Clust
from pathlib import Path
import logging
logging.getLogger().setLevel(logging.INFO)

HOME = Path.home()

dir_p = f"{HOME}projects/colabfold"
pid = 'ABL1(E255K)'
pid2 = 'ABL1'
pid2 = 'ABL1_e250_gapd1'
pid2 = 'ABL1_e250_gapd1_pregap50'
pid2 = 'ABL1_e250_gapd1_pregap10'

their_p = f"{HOME}projects/data/davis/aln"
mine_p = f"{HOME}projects/misc/test_hhblits/"

#%%
print("DAVIS sample (MINE) -", pid2)
msa = f"{mine_p}/{pid2}.a3m"
af = AF_Clust(keyword="test-"+pid2, input_msa=msa, output_dir=f"{mine_p}/af_clust")

#%% davis (my alignments):
print("DAVIS sample (THEIRS) -", pid)
msa = f"{their_p}/{pid}.aln"
af = AF_Clust(keyword="test-"+pid, input_msa=msa, output_dir=f"{dir_p}/davis_a3m/test/theirs/")

#%%
print("DAVIS sample (THEIRS) -", pid2)
msa = f"{their_p}/{pid2}.aln"
af = AF_Clust(keyword="test-"+pid2, input_msa=msa, output_dir=f"{dir_p}/davis_a3m/test/theirs/")

#%% kiba
print("KIBA sample")
pid = 'O14920'
msa = f"{dir_p}/kiba_a3m/part1/{pid}.a3m"
af = AF_Clust(keyword="test-"+pid, input_msa=msa, output_dir=f"{dir_p}/davis_a3m/test/")

# %% PDBBind
print("PDBBIND sample")
pid = '1a1e'
msa = f"{dir_p}/pdbbind_a3m/{pid}.msa.a3m"
af = AF_Clust(keyword="test-"+pid, input_msa=msa, output_dir=f"{dir_p}/test_af_clust/")

# %%

Potential solution 0: try different parameters when running hhblits:

Outcome: FAIL Results shown below in comments next to files, but any parameter tweaking I did failed to help decrease the number of gaps in the sequences.

# %%
from src.utils.af_clust import AF_Clust
from pathlib import Path
import logging
logging.getLogger().setLevel(logging.DEBUG)

HOME = Path.home()
pid = 'ABL1'

tmp_p = f"{HOME}/projects/tmp/{pid}/"
                                                                                  # rem / total
msa_uniref30_2023     = f"{tmp_p}/{pid}.a3m"                                      #  2  / 4627
msa_uniref30_2023_v1  = f"{tmp_p}/{pid}_cov25.a3m"                                #  2  /  113
msa_uniref30_2023_v2  = f"{tmp_p}/{pid}_cov26_preeval200_pregap40.a3m"            #  2  /   95
msa_uniref30_2023_v3  = f"{tmp_p}/{pid}_cov26_preeval200_pregap80_maxfilt40k.a3m" #  2  /   74
msa_uniref30_2023_v4  = f"{tmp_p}/{pid}_cov40_preeval100_pregap160_maxfilt50k.a3m"#  1  /   29
msa_uniref30_2023_v5  = f"{tmp_p}/{pid}_preeval100_pregap160_maxfilt50k.a3m"      #  1  / 3461
msa_uniref30_2023_v6  = f"{tmp_p}/{pid}_preeval100_pregap160_maxfilt50k_qid25.a3m"#  1  / 3523
msa_uniref30_2023_v7  = f"{tmp_p}/{pid}_cov25_preeval50_pregap160_maxfilt50k.a3m" #  1  /   48
msa_uniref30_2023_v8  = f"{tmp_p}/{pid}_prepre_smax100.a3m"                       #  2  / 4772
msa_uniref30_2023_v9  = f"{tmp_p}/{pid}_preeval10.a3m"                            #  2  / 5359
msa_uniref30_2023_v10 = f"{tmp_p}/{pid}_preeval2k.a3m"                            #  2  / 4400 

af = AF_Clust(keyword="test-"+pid,
            input_msa=msa_uniref30_2023_v10,
            output_dir=f"{tmp_p}/af_clust")                                 

#%%
print("DAVIS sample -", pid)
print("UNICLUST30_2023:")
af = AF_Clust(keyword="test-"+pid, 
              input_msa=msa_uniref30_2023, 
              output_dir=f"{tmp_p}/af_clust")
#%%
print("UNICLUST30_2023_COV25:")
af = AF_Clust(keyword="test-"+pid, 
              input_msa=msa_uniref30_2023_v1, 
              output_dir=f"{tmp_p}/af_clust")
#%%
print("UNICLUST30_2023_COV26_PREEVAL200_PREGAP40:")
af = AF_Clust(keyword="test-"+pid,
            input_msa=msa_uniref30_2023_v2,
            output_dir=f"{tmp_p}/af_clust")

## CMD TO RUN HHBLITS ON /home/jean/projects/ABL1.fa
# hhblits -i /home/jean/projects/ABL1.fa -oa3m /home/jean/projects/ABL1.a3m -d /home/jean/projects/data/uniclust50_2018_08/uniclust50_2018_08 -n 3 

#%%  
from src.utils.seq_alignment import MSARunner
in_fp = f"{mine_p}/{pid}.fa"
out_fp = f"{mine_p}/{pid}.a3m"
MSARunner.hhblits(in_fp, out_fp,
                  dataset='/home/jean/projects/data/uniclust50_2018_08/uniclust50_2018_08',
                  return_cmd=True)
# %%

# Prefilter options
# -pre_gap_open is the gap opening penalty for the prefiltering step. (default is 20)
# filter option applied to query MSA:
# -cov is a threshold for the coverage of the alignment (default is 0)

# hhblits -i <FASTA> -oa3m <A3M> -d <DB> -cpu 8 -n 2 -pre_evalue_thresh 500 -gapd 0.5 -cov 25

Potential solution 1: Use larger sequencing database (e.g.: UniRef50)

Build my own UniRef50 cluster for hhblits?

No I will not build UniRef50 for hhblits
- Takes up too much time and memory to do so
- Too much effort to spend into learning how to do with little reward
  - Better to test how af_clust vs randomly sampled msa works with PDBbind and revisit this at a later time.
- Root problem is because AF_clust doesnt like gaps and MSAs for davis and kiba have a lot of gaps since they have a lot of matching sequences
If we really want a larger database the better option is to use alternatives like BFD which was created for AlphaFold and has data from multiple sources (we already have this installed on h4h)
- see https://bfd.mmseqs.com/
- Or we can update the UniClust30 database that we have
- UniClust30 (v2023_02) has 36,293,491 sequences in it previous version I was using (v2018_06) has only 25,985,124 sequences
- I already tested this dataset with a couple example proteins and didnt get any improvments when playing around with the parameters

jyaacoub / MutDTA

Improve davis and kiba MSA input files #78

Code:

Potential solution 0: try different parameters when running hhblits:

Potential solution 1: Use larger sequencing database (e.g.: UniRef50)

Build my own UniRef50 cluster for hhblits?