Closed notalfredo closed 1 year ago
Thanks for submitting! It's an interesting idea, and it's definitely a use case for using the Levenshtein distance algorithm. Is this algorithm purely for biology?
From the perspective of a library user (not developer), the Hamming & Levenshtein distance algorithms have a various/wide set of applications to use them in. This includes biology, but it's not solely biology. Ideally, a library should only ship what will be used. Those 2 algorithms (at least as of right now) are the main focus, but Needleman-Wunsch is biology focused.
I do like the idea though, and I think it'd make better sense if we turn this repository into a monorepo of related crates. We can do this by using a "Cargo workspace" (https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html). If you look at the GitHub repository for the serde
crate, there's multiple crates in there like the main serde
crate, serde_derive
, and serde_derive_internals
(crate purely internal for developers).
I think we can do something like this, except move it to where all crates are in a /crates
directory. So, we could have like:
differ
: Library for just the pure distance/similarity algorithmsneedleman_wunsch
: Library for the Needleman Wunsch algorithm, which can have differ
as a dependency (if it needs it)And then in the future, having a workspace would also give way for a crate like semantic_differ
(example/placeholder name). I think you remember us talking about this, it would be semantic-like diffing which can compute the difference between two words in a linguistic manner. e.g "were" and "was" are technically 1/4 similar, but they're just two differences. Another is "person" and "people", which would give sort of low-ish scores, even though semantically they're similar, it just became plural.
If this is something you're interested in, then we should create an issue first to setup the repository for a monorepo, and then we can create a crate for the Needleman-Wunsch algorithm.
As of right now there are two algorithms I would like to implement on bio_diff that being
Both have to do with aligning protien or nucleotide sequences. Each algorithm will have their own file similar to how differ is structured. I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.
I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue https://github.com/nlp-rs/differ.rs/issues/25 from the start.
By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have bio_diff
depend on the differ
crate. By doing this, you won't have to re-implement anything.
I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.
By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have
bio_diff
depend on thediffer
crate. By doing this, you won't have to re-implement anything.
If I crate depends on another crate does this have any performance downsides ? Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?
If I crate depends on another crate does this have any performance downsides ?
No performance downsides here. Think of it this way; it would be a performance downside by having both libraries duplicate code if a user used both libraries, assuming bio_diff didn't depend on differ. It'd also be a burden on the software developer to maintain duplicate code.
Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?
They'd just have access to bio_diff
, but the user can specify differ as an explicit dependency. Rust has a feature called dependency resolving in the situation where a project has common dependencies, to keep the binary size as small as possible, so this is not a worry. :) There's an official page on this which is a longer read, if you want to learn more about the internal details: https://doc.rust-lang.org/cargo/reference/resolver.html
Declining for now, see #42, #43. This can be written inside a separate repository
This can be done with Needleman–Wunsch algorithm. Like the title mentions its an algorithm that allowed you to align protein or nucleotide sequences. This algorithm will be in its own file to follow the standard of the project.