Closed jbarnoud closed 9 years ago
Okay. Nice work so far ;-)
I think the pull request is ready for review. As the pull request is bigger, the review should be extra careful. Especially, it is worth checking the commit messages in case I misunderstood anything.
Also, I updated the pull request description.
Hello,
Quite a huge work. It seems quite relevant. I would like to be sure. Does it provide same results as with previous R version ? How does it work on big dataset ? (same problem we had with R version). If yes, could we have a limit size of snapshots used ?
On 01/05/15 11:21, Alexandre G. de Brevern wrote:
Hello,
Quite a huge work. It seems quite relevant. I would like to be sure. Does it provide same results as with previous R version ? How does it work on big dataset ? (same problem we had with R version). If yes, could we have a limit size of snapshots used ?
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/pull/56#issuecomment-98084440.
I did not change the logic. It is still the R version and it behave the same as before. Changing any behavior is a next step that required the code to be more modular. I'll have a look at various hclust implementations (R, scipy, other python modules) as soon as I have time.
I am not sure hierarchical clustering can deal with large dataset as it needs the whole distance matrix. If you have more precise comment on that subject, could you open a new issue ?
This pull request splits
PBclust.py
into functions and expose an API. See #25API description
More details about the API are described in the individual commits.
Substitution matrix
load_substitution_matrix
: load a substitution matrix from a filematrix_to_single_digit
: convert a substitution matrix expressed as similarity score to a single digit distance matrixPair comparison of sequences
compute_score_by_position
: substitution score position per positionsubstitution_score
: overall substitution score between two sequencesMultiple sequence comparison
distance_matrix
: distance matrix between a batch of sequencescompare_to_first_sequence
: compare all the sequences to the first oneClustering
hclust
: hierarchical clustring using RI/O
write_fasta_entry
: write a fasta entry (header + sequence) in an already open fasta fileExamples
This pull request allows to do clustering from the PBlib module.
The pull request also facilitates the use of other clustering tools.