Open emanjavacas opened 7 years ago
@jedgusse could you take care of this? Shouldn't be too much work, just add this options to the pipe_grid_clf pipeline.
Yup!
I'm considering two ways of doing this. Which one do you prefer?
` c_options = [1, 10, 100, 1000] kernel_options = ['linear', 'poly', 'rbf', 'sigmoid'] analyzer_options = ['char_wb'] ngram_range_options = [(2,2), (3,3), (4,4)]
param_grid = [
{
'vectorizer': [CountVectorizer(), TfidfVectorizer())],
'feature_scaling': [StandardScaler(),
Normalizer(),
FunctionTransformer(deltavectorizer)],
'classifier__C': c_options,
'classifier__kernel': kernel_options,
},
{
'vectorizer': [CountVectorizer(), TfidfVectorizer()],
'vectorizer__analyzer': analyzer_options,
'vectorizer__ngram_range': ngram_range_options,
'feature_scaling': [StandardScaler(),
Normalizer(),
FunctionTransformer(deltavectorizer)],
'classifier__C': c_options,
'classifier__kernel': kernel_options,
},
]`
Or rather:
` c_options = [1, 10, 100, 1000] kernel_options = ['linear', 'poly', 'rbf', 'sigmoid'] analyzer_options = ['char_wb'] ngram_range_options = [(2,2), (3,3), (4,4)]
param_grid = [
{
'vectorizer': [CountVectorizer(), TfidfVectorizer(), CountVectorizer(analyzer='char_wb', ngram_range=(2,2)), TfidfVectorizer(analyzer='char_wb', ngram_range=(2,2))],
'feature_scaling': [StandardScaler(),
Normalizer(),
FunctionTransformer(deltavectorizer)],
'classifier__C': c_options,
'classifier__kernel': kernel_options,
},
]`
The second one looks a bit cleaner in the code and will run faster. But: it offers less options in n_gram volume.
I've pushed the first one for now (since it leaves you most options)! Let me know when you want to see it changed.
What about
param_grid = [
{
'vectorizer': [CountVectorizer(), TfidfVectorizer()],
'vectorizer__analyzer': analyzer_options,
'vectorizer__ngram_range': [(2, 4)],
'feature_scaling': [StandardScaler(),
Normalizer(),
FunctionTransformer(deltavectorizer)],
'classifier__C': c_options,
'classifier__kernel': kernel_options,
},
]
Mmmmh, I think we can pool all the ngrams together in the same countvectorizer, especially if we include a topk feature selection. What does @mikekestemont think?
An ngram_range of (2,4) will give you a feature vector which includes bigrams, trigrams and four-grams in one and the same run. Moreover, in this grid, you will also make word-grams aside from char_grams since you have automatically set an ngram_range for each run, that is why I think we cannot do without a separate grid within the grid. :)
The Countvectorizer() won't help much, so we can leave it out I guess. Top feature selection, AFAIK, isn't very relevant from SVMs either.
Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54
Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172
On Mon, May 15, 2017 at 3:28 PM, Enrique Manjavacas < notifications@github.com> wrote:
What about
param_grid = [ { 'vectorizer': [CountVectorizer(), TfidfVectorizer()], 'vectorizeranalyzer': analyzer_options, 'vectorizerngram_range': [(2, 4)], 'feature_scaling': [StandardScaler(), Normalizer(), FunctionTransformer(deltavectorizer)], 'classifierC': c_options, 'classifierkernel': kernel_options, }, ]
Mmmmh, I think we can pool all the ngrams together in the same countvectorizer, especially if we include a topk feature selection. What does @mikekestemont https://github.com/mikekestemont think?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301474692, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL93Srj2ibDw5Na03FqpoU4n-Vi7gks5r6FMLgaJpZM4Na5FZ .
And what about the standardized CountVectorizer()? In that case we also lose Delta in the experiment. Can standardized raw counts compete with TFIDF?
If they're standardized, it might help; but so far, the CountVectorizer() has never come out as the best option in the gridsearch, right?
Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54
Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172
On Mon, May 15, 2017 at 3:35 PM, jedgusse notifications@github.com wrote:
And what about the standardized CountVectorizer()? In that case we also lose Delta in the experiment. Can standardized raw counts compete with TFIDF?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301476406, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL2J_TGQZeSlikXFWkuWkR4E3qXFOks5r6FScgaJpZM4Na5FZ .
Oops, there was a number of issues with my sloppy wording. I wanted to mean just vectorizer not only countvectorizer (the ngram selection seems to work with whatever vectorizer, and judging by the current runs tfidf usually gets picked, so we could kicked out the countvectorizer). By topk feature selection I meant the vocabulary selection mentioned in the issue description. Is it possible to include something like that in the pipeline right after the vectorization?
It seems the vectorizers also have an option
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
It also seems that it will pick max features based on term frequency, which means 2grams, 3grams, and beyond won't get picked... Perhaps we can precompute some kind of X2-feat selection and pass it as vocabulary.. but I don't know how easy that'd be without giving up the pipeline
Yes the max_features parameter should be added! I'll make sure of that.
Throwing out the CountVectorizer() is only for sake of speed, right? I think it still makes sense to at least report that we included it in preliminary runs of the GridSearch?
I just realized that it doesn't make sense to speak of corpus-wide max_features according to tfidf because tfidf is only defined at the document level... so I wonder what the best way is to do this (i.e. selecting top n elements in the vocabulary according to their importance for the classification and not just according to frequency).
@jedgusse Ye, I think we can report that it didn't outperform tfidf vectorizer
I just noticed that adding the max_parameter
for the Vectorizers takes ages to run on really small texts. Might it not make sense to trust the assumption that SVM's can take in a ton of features without any need for a max range or any form of dimensionality reduction?
Why would Tfidf be defined at the document level? In every vector the frequency of a word is normalized by how many times it has occurred over the entire corpus. What is - however - somewhat relevant here is that we might want to split up the authors' training corpus into smaller samples (in order to have a more granular document frequency - or perhaps even: to even speak of a document frequency which is any different per author). There is no sampling so far, right?
I would restrict the gridsearch to:
TfIdfVectorizer(use_idf=True|False, max_features=range(1000, 30000, stepsize=2000), norm='l1'|'l2')
All the scaling probably won't have an effect if we just the SVM, so this will limit the possibilities enormously. CV can happen over 5 random splits or so.
Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54
Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172
On Mon, May 15, 2017 at 3:59 PM, Enrique Manjavacas < notifications@github.com> wrote:
@jedgusse https://github.com/jedgusse Ye, I think we can report that it didn't outperform tfidf vectorizer
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301483442, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJLwc7SD0BmySYZhCAyH2TTvYg3VO2ks5r6FpAgaJpZM4Na5FZ .
Works for me!
Hi, the current setup amounts to a total of 960 items in the grid run 5 times for each cv and twice for both the normal classification and the data augmentation setting. This amounts to roughly 6h per experiment when run on all 12 cores of calc9. We need to cut down the search space. I propose removing kernels and perhaps makin n_features_options smaller.
I remember @mikekestemont mentioning that linear SVM's are generally the best-scoring kernels, and I believe that our experiments so far have always confirmed this theory.
I am leaving in the linear and the rbf (for non-linear cases), but I think the major bottleneck is n_features_options.
I am gonna change it to 1000, 3000, 5000, 10000, 15000, 30000
If you're running it on 12 cores you might also want to change this in the grid
, more specifically the parameter n_jobs
, but you probably thought of that.
Agreed.
On 16 May 2017, at 10:49, jedgusse notifications@github.com wrote:
If you're running it on 12 cores you might also want to change this in the grid, more specifically the parameter n_jobs, but you probably thought of that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.