Closed jedgusse closed 7 years ago
Hi, I can help you out with replacing compress
, but would you mind documenting that file a little bit, like what type and shape is doc_vectors
, etc...?
I could also help with the class-based implementation of the vectorizers, but for that I'd probably need to talk to you to see what exactly you want to accomplish.
Perhaps for this as well it might be better to wait until Friday. I can definitely push a better documented script, but first we might want to sit together on the git pushing and the commit history issues.
Hi Jeroen,
could push the last changes you made to the the classifier.py file? If you could, try to remove all other code files currently in the repo that we don't need right now. I will have time to work on this during the week, so I could also have a look at the Normalization issues that we were encountering last week.
Hi Enrique,
It should be there already! Have a look. :)
I encountered no more normalization issues after replacing all NaN's to zeros (which was indeed what they should have been, you were right that the issue was caused by a division error).
All vectorizers are now taken up in the pipeline. I think we are almost ready to roll as far as the classifier is concerned.
All the best,
Jeroen
Van: Enrique Manjavacas notifications@github.com Verzonden: maandag 8 mei 2017 15:32 Aan: jedgusse/project_lorenzo CC: Jeroen De Gussem; Author Onderwerp: Re: [jedgusse/project_lorenzo] Classifier.py code (#3)
Hi Jeroen,
could push the last changes you made to the the classifier.py file? If you could, try to remove all other code files currently in the repo that we don't need right now. I will have time to work on this during the week, so I could also have a look at the Normalization issues that we were encountering last week.
- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jedgusse/project_lorenzo/issues/3#issuecomment-299867754, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APEq-u67mRAi2DysUcw3NtGFw8XeHkgqks5r3xmGgaJpZM4NLasF.
Oh, I hadn't looked! Cool, thanks! I'll try to pack this into a class and write some code to make it easy to run experiments.
PD: I've done some cleanup with the files etc..
Sure, go ahead! :) Thanks!
Hi Mike and Enrique,
I've put up some code for the Pipeline-Gridsearch classifier (
classifier.py
). I'd be very grateful for any feedback and suggestions on changes.I looked up what you mentioned earlier in the Agora, concerning the translation of boolean lists
[True False]
to a list of features in a less expensive way as opposed to looping through it. I ended up doing it through thecompress
method initertools
. I recall you were mentioning a numpy way of doing that as well, perhaps a more efficient one, since itertools is still pretty much dependent on looping, no?Perhaps the most challenging problem which I did not really get to work (at least not as implemented in one and the same pipeline) was to also gridsearch over custom-made vectorization methods (e.g. Tfidf-vectorization vs. Delta scores). I tried to adapt my own code to methods which follow the
sklearn
conventions (e.g. this link), but it proved really hard. This probably has to do with my somewhat lacking knowledge on classes in general, and especially poor knowledge of how the sklearn source code is constructed (which seems pretty crucial when I try to make my own custom code adapt to its rules). Any advice here is more than welcome while I work on this a bit more.Should I also gridsearch over other hyperparameters, such as other kinds of cross-validation, or accuracy metrics? Btw, in the meantime I have looked up the Stratification in the folds. Apparently, stratification is set as a default parameter in the Gridsearch, so that is covered.
Thanks!