adapt CFRClassifier to dataset and `tech.ml`

simongray commented 3 years ago

Thank you for making a PR.

I do have some comments:

Stuff like_ bindings inside a let seem a little bit sloppy, looks like temporary code to me. Perhaps you could use doto where applicable or refactor it a bit when doto doesn't cut it?
I will also need some docstrings for the public fns at the very least. You've already made some comments about various parts of the code in this PR, so perhaps some of that could be refashioned into docstrings? I use the Markdown-like format where name-of-param is enclosed in backticks.
Code formatting should try to follow the rest of the codebase. I try to use the Clojure style guide with minor variations. Basically, I just use Cursive's autoformatting feature, with vertical alignment enabled. If you take a look at my code it shouldn't be too hard to see how I format stuff, since I'm pretty consistent about it.
The Rich Comment block is fine, but since I really have no experience with the tech.ml stack it would be nice if that testing code could be formalised as an actual runnable test, so that it is made clear what the right result is supposed to be. I know that I haven't really made any tests myself, but this project - so far - consists entirely of my own code so I know what it's supposed to be doing.
I don't think the clojurewerkz dependency is strictly necessary. If you need code for converting a map into props, I have made some already that could be moved to a util ns of some sort. But perhaps the clojurewerkz stuff works better than mine, in which case I should probably delete my own code and just use that instead, but I'm not sure it necessarily does. Will have to take a look at it and report back to you.

I can perhaps help you make some of the changes (props conversion and any final touches on formatting), but you will need to work a little bit more on it yourself first, before I am ready to accept it. Hope that is okay with you. If that's too bothersome, just put it in your own repo instead :)

behrica commented 3 years ago

Thanks for review and suggestions.

I addressed all of them, except the Code formatting, as I don't use Cursive, but emacs.

I made the properties function in datalinguist.clj public for the moment and it does work. Maybe you want to move this to a different ns. It works for me.

simongray commented 3 years ago

I took me a while to check out the changes. I just got a kid (he'll be 2 months tomorrow) so a lot of my sparetime is spent on him at the moment.

I made a lot of code style changes to make the code better fit in with the rest of the repo. The only major functional change is an introduction of a define-model! function, since I noticed that your code didn't provide a way to redefine the behaviour of the model using the options map (... without reaching into the source code).

I hope you agree with the changes. If you're ok with it, I'm ready to merge.

behrica commented 3 years ago

It is as well foreseen to add specific metadata, which would point to: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFClassifier.html and to a "user guide", if existing.

I have done this only for one model so far:

:smile.classification/gradient-tree-boost

https://github.com/scicloj/scicloj.ml.smile/blob/49a753c8ce6ae991b591a563f6a723c660ddb776/src/scicloj/ml/smile/classification.clj#L150

I would therefore propose to do the same here. The doc-string of the "train" method can maybe contain "pointers" to this as well

simongray commented 3 years ago

It is as well foreseen to add specific metadata, which would point to: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFClassifier.html and to a "user guide", if existing.

I have done this only for one model so far:

:smile.classification/gradient-tree-boost

https://github.com/scicloj/scicloj.ml.smile/blob/49a753c8ce6ae991b591a563f6a723c660ddb776/src/scicloj/ml/smile/classification.clj#L150

I would therefore propose to do the same here. The doc-string of the "train" method can maybe contain "pointers" to this as well

Sure, this all sounds good to me. Feel free to do it in a follow-up PR once this has been merged.

behrica commented 3 years ago

Seems all good to me. Please revert the "define.model!" changes and leave the todo for adding the options. I will do this soon.

simongray commented 3 years ago

I added a todo at the top of the crf namespace. Not sure what to write so I just referenced this PR.

I have reverted the change as you requested, although I did keep my renaming of the model name itself. Aside from :standford-nlp/crf being misspelled, the name of the library being wrapped is in fact Stanford CoreNLP, while StanfordNLP is the old name for the Python library that is now called Stanza. So I will insist on it being called either :corenlp/crf or :stanford-corenlp/crf since I think that is more proper (and CoreNLP is also how I refer to it elsewhere).

If you're OK with how everything looks, I suggest you press the merge button.

simongray / datalinguist

adapt CFRClassifier to dataset and `tech.ml` #6