General architecture feedback

rth commented 5 years ago

@rth awesome, I'm all about collaboration; I'm going to checkout the package after work today!

Great, thank you @jbowles! Feel free to write any general comments you have about this project here (or in any of the related issues).

To give you some background, I have been working on topics related to CountVectorizer / HashingVectorizer in scikit-learn for a few years and this project originated as an attempt in making those faster. A few things got added along the way. I'm a fairly beginner Rust programmer, so general feedback about the architecture of this crate would be very welcome. In particular, adding more common traits per module would probably be good (I started some of the work on it in #48). Some of it was also limited by the fact that I wanted to make a thin wrapper in PyO3 to expose the functionality in Python which adds some constraints (e.g. https://github.com/rth/vtext/pull/48#issuecomment-488223434)

For tokenization, one thing I saw was that if one takes the unicode-segmentation crate, it will tokenize the text almost exactly as expected for NLP applications, with a few exceptions. The nice thing about it is that it's language independent and based on the Unicode spec, which removes the need to maintain a large number of regexp / custom rules. To improve the F1 score for tokenization on the UD treebank a few custom rules are additionally applied.

On the other side, we can imagine other tokenizers. In particular, the fact that some tasks require custom processing is a valid point. I'm not sure how to make that easier.

I also found an implementation of punkt tokenizer rust-punkt!

Yes, it looks quite good. Related issue #51

Generally if can do anything to make this collaboration easier please let me know :)

rth commented 5 years ago

For string similarities, these are very close adaptations of NLTK Python code, except for Dice similarity which is relatively straightforward. There is probably some room for improvement, in particular for Levenshtein. Actually, I just discovered https://github.com/dguo/strsim-rs which also covers most of these.

jbowles commented 5 years ago

@rth thanks for the welcome, I'll be going through the package this weekend and next week; If I'm productive enough I'll have some examples worked through (and maybe even a blog post!).

unicode-segmentation is awesome, it's one of the first things the drew me to Rust (IMO, tokenization needs to be general enough to handle any digital format-- my background includes studying endangered indigenous languages that are highly "non-standard" so I've always come from a "disadvantaged" point in NLP... in that many of the simplifying assumptions to get NLP systems running have always run counter to my intuition about how human languages work haha ;) ... anyway, my point is that I think unicode parsers/tokenizers are the way to go)
I'm also fairly new to rust (I got into it circa 2015, left, and came back recently... I also spend quit a bit of time in go and julia... so yeah my sense of good Rust architecture is in state of evolution proportional to my learning ... :) )
string metrics, strsim-rs looks nice.

rth / vtext

General architecture feedback #52