poke1024 / pyalign

Fast and Versatile Alignments for Python
MIT License
47 stars 6 forks source link

Using vectors instead of characters #3

Closed pseudo-rnd-thoughts closed 2 years ago

pseudo-rnd-thoughts commented 2 years ago

I would like to use this outside of bioinformatics where for each character, it is a vector (np.ndarray) and distance function for computing the "distance" between vectors. All your examples using strings, I was interested if this is possible with pyalign?

poke1024 commented 2 years ago

Yes, this is possible. Using pyalign.problems.general you can pass in any distance or similarity function. Here is an example code snippet that computes an alignment between words, where each word is represented through an embedding vector and word similarity is computed through cosine similarity between those vectors:

import pyalign

# compute some word embeddings
import spacy
nlp = spacy.load("en_core_web_md") 
import numpy as np
a = np.array([x.vector for x in nlp("old books and newer manuscripts")])
b = np.array([x.vector for x in nlp("recent writings")])

# solve alignment
from numpy.linalg import norm

def cosine_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

pf = pyalign.problems.general(
    cosine_sim,
    direction="maximize")

solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=pyalign.solve.Solution)

problem = pf.new_problem(a, b)

solver.solve(problem)

If you pass in a distance function (instead of an affinity as above), you would use:

pf = pyalign.problems.general(
    some_distance_func,
    direction="minimize")
pseudo-rnd-thoughts commented 2 years ago

Amazing, thanks