nlp-rs / typediff

Provides edit distance, delta vectors between 2 words, and word transformation
Apache License 2.0
1 stars 0 forks source link

feature: Allow to alignment between protein or nucleotide sequences #16

Closed notalfredo closed 1 year ago

notalfredo commented 1 year ago

This can be done with Needleman–Wunsch algorithm. Like the title mentions its an algorithm that allowed you to align protein or nucleotide sequences. This algorithm will be in its own file to follow the standard of the project.

neoncitylights commented 1 year ago

Thanks for submitting! It's an interesting idea, and it's definitely a use case for using the Levenshtein distance algorithm. Is this algorithm purely for biology?

From the perspective of a library user (not developer), the Hamming & Levenshtein distance algorithms have a various/wide set of applications to use them in. This includes biology, but it's not solely biology. Ideally, a library should only ship what will be used. Those 2 algorithms (at least as of right now) are the main focus, but Needleman-Wunsch is biology focused.

I do like the idea though, and I think it'd make better sense if we turn this repository into a monorepo of related crates. We can do this by using a "Cargo workspace" (https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html). If you look at the GitHub repository for the serde crate, there's multiple crates in there like the main serde crate, serde_derive, and serde_derive_internals (crate purely internal for developers).

I think we can do something like this, except move it to where all crates are in a /crates directory. So, we could have like:

And then in the future, having a workspace would also give way for a crate like semantic_differ (example/placeholder name). I think you remember us talking about this, it would be semantic-like diffing which can compute the difference between two words in a linguistic manner. e.g "were" and "was" are technically 1/4 similar, but they're just two differences. Another is "person" and "people", which would give sort of low-ish scores, even though semantically they're similar, it just became plural.

neoncitylights commented 1 year ago

If this is something you're interested in, then we should create an issue first to setup the repository for a monorepo, and then we can create a crate for the Needleman-Wunsch algorithm.

notalfredo commented 1 year ago

As of right now there are two algorithms I would like to implement on bio_diff that being

Both have to do with aligning protien or nucleotide sequences. Each algorithm will have their own file similar to how differ is structured. I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.

neoncitylights commented 1 year ago

I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue https://github.com/nlp-rs/differ.rs/issues/25 from the start.

By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have bio_diff depend on the differ crate. By doing this, you won't have to re-implement anything.

notalfredo commented 1 year ago

I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.

By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have bio_diff depend on the differ crate. By doing this, you won't have to re-implement anything.

If I crate depends on another crate does this have any performance downsides ? Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?

neoncitylights commented 1 year ago

If I crate depends on another crate does this have any performance downsides ?

No performance downsides here. Think of it this way; it would be a performance downside by having both libraries duplicate code if a user used both libraries, assuming bio_diff didn't depend on differ. It'd also be a burden on the software developer to maintain duplicate code.

Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?

They'd just have access to bio_diff, but the user can specify differ as an explicit dependency. Rust has a feature called dependency resolving in the situation where a project has common dependencies, to keep the binary size as small as possible, so this is not a worry. :) There's an official page on this which is a longer read, if you want to learn more about the internal details: https://doc.rust-lang.org/cargo/reference/resolver.html

neoncitylights commented 1 year ago

Declining for now, see #42, #43. This can be written inside a separate repository