API description

More details about the API are described in the individual commits.

Substitution matrix

load_substitution_matrix: load a substitution matrix from a file
matrix_to_single_digit: convert a substitution matrix expressed as similarity score to a single digit distance matrix
Pair comparison of sequences
compute_score_by_position: substitution score position per position
substitution_score: overall substitution score between two sequences
Multiple sequence comparison
distance_matrix: distance matrix between a batch of sequences
compare_to_first_sequence: compare all the sequences to the first one
Clustering
hclust: hierarchical clustring using R
I/O
write_fasta_entry: write a fasta entry (header + sequence) in an already open fasta file
Examples

This pull request allows to do clustering from the PBlib module.

import PBlib
import PDBlib
# Assign PB for all models
pb_seq = []
pdb = PDBlib.PDB('demo1/2LFU.pdb')
for chain in pdb.get_chains():
    dihedrals = chain.get_phi_psi_angles()
    pb_seq.append(PBlib.assign(dihedrals, PBlib.REFERENCES))
# Build the distance matrix
substitution_matrix = PBlib.load_substitution_matrix(PBlib.SUBSTITUTION_MATRIX_NAME)
distances = PBlib.distance_matrix(pb_seq, substitution_matrix)
# Do the clustering
cluster_id, medoid_id = PBlib.hclust(distances, nclusters=3)
print(cluster_id)

# Display the distance matrix
from matplotlib import pyplot
image = pyplot.imshow(distances, interpolation='none')
pyplot.colorbar(image)
pyplot.show()

distance_matrix

The pull request also facilitates the use of other clustering tools.

from matplotlib import pyplot
from scipy.cluster.hierarchy import ward, dendrogram, fcluster
# Computing the linkage with the distance matrix we computed earlier
data_link = ward(distances)
# Display the dendrogram
dendrogram(data_link,labels=range(len(pb_seq)))
pyplot.xlabel('Samples')
pyplot.ylabel('Distance')
# Print the cluster IDs
print(fcluster(data_link, 3, criterion='maxclust'))

dendrogram

pierrepo commented 9 years ago

Okay. Nice work so far ;-)

jbarnoud commented 9 years ago

I think the pull request is ready for review. As the pull request is bigger, the review should be extra careful. Especially, it is worth checking the commit messages in case I misunderstood anything.

Also, I updated the pull request description.

alexdb27 commented 9 years ago

Hello,

Quite a huge work. It seems quite relevant. I would like to be sure. Does it provide same results as with previous R version ? How does it work on big dataset ? (same problem we had with R version). If yes, could we have a limit size of snapshots used ?

jbarnoud commented 9 years ago

On 01/05/15 11:21, Alexandre G. de Brevern wrote:

Hello,

Quite a huge work. It seems quite relevant. I would like to be sure. Does it provide same results as with previous R version ? How does it work on big dataset ? (same problem we had with R version). If yes, could we have a limit size of snapshots used ?

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/pull/56#issuecomment-98084440.

I did not change the logic. It is still the R version and it behave the same as before. Changing any behavior is a next step that required the code to be more modular. I'll have a look at various hclust implementations (R, scipy, other python modules) as soon as I have time.

I am not sure hierarchical clustering can deal with large dataset as it needs the whole distance matrix. If you have more precise comment on that subject, could you open a new issue ?

pierrepo / PBxplore

modularize PBclust #56

API description

Substitution matrix

Pair comparison of sequences

Multiple sequence comparison

Clustering

I/O

Examples