scripts/pdbfilter.py does not filter out the same protein for some cases.

:exclamation: Make to check out our User Guide.

Expected Behavior

When a PDB has multiple chains of the same protein, I expect the script to leave only one of the chains. For example with the following input files,

cluster.tsv

4APC_A  4APC_A
4APC_A  4APC_B
4APC_A  4B9D_A
4APC_A  4B9D_B

pdb_filter.dat

#pdb_chain  resolution  r_free  completeness    method
4APC_A  2.1 0.248   0.837   X-RAY DIFFRACTION
4APC_B  2.1 0.248   0.846   X-RAY DIFFRACTION
4B9D_A  1.9 0.222   0.829   X-RAY DIFFRACTION
4B9D_B  1.9 0.222   0.843   X-RAY DIFFRACTION

I expected it resulted in one representative output, but it resulted in two sequences, 4B9D_B and 4APC_B.

Current Behavior

It occasionally resulted in multiple chains.

Steps to Reproduce (for bugs)

pdbfilter_debug.zip

unzip pdbfilter_debug.zip
cd pdbfilter_debug
pdbfilter.py input.fas cluster.tsv pdb_filter.dat output.fas

Suggested debugging

As they are in the same cluster, I think they should result in one representative sequence. I think it can be debugged by modifying the file like this:

    if best_entry_res is not None:
        selected_sequences.add(best_entry_res)

        if DEBUG:
            print (' - Selected {n} (best resolution = {r}).'.format(
                n = best_entry_res,
                r = best_res))

    elif best_entry_rfr is not None:
        selected_sequences.add(best_entry_rfr)

        if DEBUG:
            print (' - Selected {n} (best R-free = {r}).'.format(
                n = best_entry_rfr,
                r = best_rfr))

    elif best_entry_comp is not None:
        selected_sequences.add(best_entry_comp)

        if DEBUG:
            print (' - Selected {n} (best completness = {r}).'.format(
                n = best_entry_comp,
                r = best_comp))    

    else best_entry_res == None and best_entry_rfr == None and best_entry_comp == None:
        print ('! Warning: Did not find any representative entry for cluster {c}.'.format(
            c = cluster))

soedinglab / hh-suite