function extract_hmm_profile in alphafold_casp13

yamule commented 3 years ago

This is a bit old issue but in the hmm profile extraction function in alphafold_casp13/README.md, all of the values are calculated using the formula 2**(-float(t) / 1000.)

def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
  """Extracts information from the hmm file and replaces asterisks."""
  profile_part = hhm_file.split('#')[-1]
  profile_part = profile_part.split('\n')
  whole_profile = [i.split() for i in profile_part]
  # This part strips away the header and the footer.
  whole_profile = whole_profile[5:-2]
  gap_profile = np.zeros((len(sequence), 10))
  aa_profile = np.zeros((len(sequence), 20))
  count_aa = 0
  count_gap = 0
  for line_values in whole_profile:
    if len(line_values) == 23:
      # The first and the last values in line_values are metadata, skip them.
      for j, t in enumerate(line_values[2:-1]):
        aa_profile[count_aa, j] = (
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_aa += 1
    elif len(line_values) == 10:
      for j, t in enumerate(line_values):
        gap_profile[count_gap, j] = (
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_gap += 1
    elif not line_values:
      pass
    else:
      raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
                       'got %d'%(line_values, len(line_values)))
  hmm_profile = np.hstack([aa_profile, gap_profile])
  assert len(hmm_profile) == len(sequence)
  return hmm_profile

However, according to the hh-suite user guide https://github.com/soedinglab/hh-suite/releases/tag/userguide , "The three local diversities, Neff_M, Neff_I, and Neff_D are given in units of 0.001 (see next paragraph). "

Thus, I think it would be better to be as follows:

def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
  """Extracts information from the hmm file and replaces asterisks."""
  profile_part = hhm_file.split('#')[-1]
  profile_part = profile_part.split('\n')
  whole_profile = [i.split() for i in profile_part]
  # This part strips away the header and the footer.
  whole_profile = whole_profile[5:-2]
  gap_profile = np.zeros((len(sequence), 10))
  aa_profile = np.zeros((len(sequence), 20))
  count_aa = 0
  count_gap = 0
  for line_values in whole_profile:
    if len(line_values) == 23:
      # The first and the last values in line_values are metadata, skip them.
      for j, t in enumerate(line_values[2:-1]):
        aa_profile[count_aa, j] = (
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_aa += 1
    elif len(line_values) == 10:
      for j, t in enumerate(line_values):
        if j > 6:
          gap_profile[count_gap, j] = float(t) / 1000
        else:
          gap_profile[count_gap, j] = ( 
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_gap += 1 
    elif not line_values:
      pass
    else:
      raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
                       'got %d'%(line_values, len(line_values)))
  hmm_profile = np.hstack([aa_profile, gap_profile])
  assert len(hmm_profile) == len(sequence)
  return hmm_profile

Augustin-Zidek commented 3 years ago

Thank you, good catch! I will push a fix shortly.

I think the 3 Neff values are not being used so this shouldn't impact AlphaFold, but it is always good to have correctly implemented code snippets (that people might reuse).

yamule commented 3 years ago

Oh really? I think Neffs are quite important. They represent weather the profiles of those positions are reliable or not (in my opinion).

google-deepmind / deepmind-research

function extract_hmm_profile in alphafold_casp13 #200