rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

General architecture feedback #52

Open rth opened 5 years ago

rth commented 5 years ago

@rth awesome, I'm all about collaboration; I'm going to checkout the package after work today!

Great, thank you @jbowles! Feel free to write any general comments you have about this project here (or in any of the related issues).

To give you some background, I have been working on topics related to CountVectorizer / HashingVectorizer in scikit-learn for a few years and this project originated as an attempt in making those faster. A few things got added along the way. I'm a fairly beginner Rust programmer, so general feedback about the architecture of this crate would be very welcome. In particular, adding more common traits per module would probably be good (I started some of the work on it in #48). Some of it was also limited by the fact that I wanted to make a thin wrapper in PyO3 to expose the functionality in Python which adds some constraints (e.g. https://github.com/rth/vtext/pull/48#issuecomment-488223434)

For tokenization, one thing I saw was that if one takes the unicode-segmentation crate, it will tokenize the text almost exactly as expected for NLP applications, with a few exceptions. The nice thing about it is that it's language independent and based on the Unicode spec, which removes the need to maintain a large number of regexp / custom rules. To improve the F1 score for tokenization on the UD treebank a few custom rules are additionally applied.

On the other side, we can imagine other tokenizers. In particular, the fact that some tasks require custom processing is a valid point. I'm not sure how to make that easier.

I also found an implementation of punkt tokenizer rust-punkt!

Yes, it looks quite good. Related issue #51

Generally if can do anything to make this collaboration easier please let me know :)

rth commented 5 years ago

For string similarities, these are very close adaptations of NLTK Python code, except for Dice similarity which is relatively straightforward. There is probably some room for improvement, in particular for Levenshtein. Actually, I just discovered https://github.com/dguo/strsim-rs which also covers most of these.

jbowles commented 5 years ago

@rth thanks for the welcome, I'll be going through the package this weekend and next week; If I'm productive enough I'll have some examples worked through (and maybe even a blog post!).