Tweak pipeline (vectorizers)

emanjavacas commented 7 years ago

add char-level 2-gram, 3-gram & 4-grams
vary vocabulary (max features from 300-100000)

emanjavacas commented 7 years ago

@jedgusse could you take care of this? Shouldn't be too much work, just add this options to the pipe_grid_clf pipeline.

jedgusse commented 7 years ago

Yup!

jedgusse commented 7 years ago

I'm considering two ways of doing this. Which one do you prefer?

` c_options = [1, 10, 100, 1000] kernel_options = ['linear', 'poly', 'rbf', 'sigmoid'] analyzer_options = ['char_wb'] ngram_range_options = [(2,2), (3,3), (4,4)]

param_grid = [
    {
        'vectorizer': [CountVectorizer(), TfidfVectorizer())],
        'feature_scaling': [StandardScaler(),
                            Normalizer(),
                            FunctionTransformer(deltavectorizer)],
        'classifier__C': c_options,
        'classifier__kernel': kernel_options,
    },
    {
        'vectorizer': [CountVectorizer(), TfidfVectorizer()],
        'vectorizer__analyzer': analyzer_options,
        'vectorizer__ngram_range': ngram_range_options,
        'feature_scaling': [StandardScaler(),
                            Normalizer(),
                            FunctionTransformer(deltavectorizer)],
        'classifier__C': c_options,
        'classifier__kernel': kernel_options,
    },
]`

Or rather:

` c_options = [1, 10, 100, 1000] kernel_options = ['linear', 'poly', 'rbf', 'sigmoid'] analyzer_options = ['char_wb'] ngram_range_options = [(2,2), (3,3), (4,4)]

param_grid = [
    {
        'vectorizer': [CountVectorizer(), TfidfVectorizer(), CountVectorizer(analyzer='char_wb', ngram_range=(2,2)), TfidfVectorizer(analyzer='char_wb', ngram_range=(2,2))],
        'feature_scaling': [StandardScaler(),
                            Normalizer(),
                            FunctionTransformer(deltavectorizer)],
        'classifier__C': c_options,
        'classifier__kernel': kernel_options,
    },
]`

The second one looks a bit cleaner in the code and will run faster. But: it offers less options in n_gram volume.

jedgusse commented 7 years ago

I've pushed the first one for now (since it leaves you most options)! Let me know when you want to see it changed.

emanjavacas commented 7 years ago

What about

param_grid = [
    {
        'vectorizer': [CountVectorizer(), TfidfVectorizer()],
        'vectorizer__analyzer': analyzer_options,
        'vectorizer__ngram_range': [(2, 4)],
        'feature_scaling': [StandardScaler(),
                            Normalizer(),
                            FunctionTransformer(deltavectorizer)],
        'classifier__C': c_options,
        'classifier__kernel': kernel_options,
    },
]

Mmmmh, I think we can pool all the ngrams together in the same countvectorizer, especially if we include a topk feature selection. What does @mikekestemont think?

jedgusse commented 7 years ago

An ngram_range of (2,4) will give you a feature vector which includes bigrams, trigrams and four-grams in one and the same run. Moreover, in this grid, you will also make word-grams aside from char_grams since you have automatically set an ngram_range for each run, that is why I think we cannot do without a separate grid within the grid. :)

mikekestemont commented 7 years ago

The Countvectorizer() won't help much, so we can leave it out I guess. Top feature selection, AFAIK, isn't very relevant from SVMs either.

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Mon, May 15, 2017 at 3:28 PM, Enrique Manjavacas < notifications@github.com> wrote:

What about

param_grid = [ { 'vectorizer': [CountVectorizer(), TfidfVectorizer()], 'vectorizeranalyzer': analyzer_options, 'vectorizerngram_range': [(2, 4)], 'feature_scaling': [StandardScaler(), Normalizer(), FunctionTransformer(deltavectorizer)], 'classifierC': c_options, 'classifierkernel': kernel_options, }, ]

Mmmmh, I think we can pool all the ngrams together in the same countvectorizer, especially if we include a topk feature selection. What does @mikekestemont https://github.com/mikekestemont think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301474692, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL93Srj2ibDw5Na03FqpoU4n-Vi7gks5r6FMLgaJpZM4Na5FZ .

jedgusse commented 7 years ago

And what about the standardized CountVectorizer()? In that case we also lose Delta in the experiment. Can standardized raw counts compete with TFIDF?

mikekestemont commented 7 years ago

If they're standardized, it might help; but so far, the CountVectorizer() has never come out as the best option in the gridsearch, right?

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Mon, May 15, 2017 at 3:35 PM, jedgusse notifications@github.com wrote:

And what about the standardized CountVectorizer()? In that case we also lose Delta in the experiment. Can standardized raw counts compete with TFIDF?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301476406, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL2J_TGQZeSlikXFWkuWkR4E3qXFOks5r6FScgaJpZM4Na5FZ .

emanjavacas commented 7 years ago

Oops, there was a number of issues with my sloppy wording. I wanted to mean just vectorizer not only countvectorizer (the ngram selection seems to work with whatever vectorizer, and judging by the current runs tfidf usually gets picked, so we could kicked out the countvectorizer). By topk feature selection I meant the vocabulary selection mentioned in the issue description. Is it possible to include something like that in the pipeline right after the vectorization?

emanjavacas commented 7 years ago

It seems the vectorizers also have an option

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.

It also seems that it will pick max features based on term frequency, which means 2grams, 3grams, and beyond won't get picked... Perhaps we can precompute some kind of X2-feat selection and pass it as vocabulary.. but I don't know how easy that'd be without giving up the pipeline

jedgusse commented 7 years ago

Yes the max_features parameter should be added! I'll make sure of that.

jedgusse commented 7 years ago

Throwing out the CountVectorizer() is only for sake of speed, right? I think it still makes sense to at least report that we included it in preliminary runs of the GridSearch?

emanjavacas commented 7 years ago

I just realized that it doesn't make sense to speak of corpus-wide max_features according to tfidf because tfidf is only defined at the document level... so I wonder what the best way is to do this (i.e. selecting top n elements in the vocabulary according to their importance for the classification and not just according to frequency).

emanjavacas commented 7 years ago

@jedgusse Ye, I think we can report that it didn't outperform tfidf vectorizer

jedgusse commented 7 years ago

I just noticed that adding the max_parameter for the Vectorizers takes ages to run on really small texts. Might it not make sense to trust the assumption that SVM's can take in a ton of features without any need for a max range or any form of dimensionality reduction? Why would Tfidf be defined at the document level? In every vector the frequency of a word is normalized by how many times it has occurred over the entire corpus. What is - however - somewhat relevant here is that we might want to split up the authors' training corpus into smaller samples (in order to have a more granular document frequency - or perhaps even: to even speak of a document frequency which is any different per author). There is no sampling so far, right?

mikekestemont commented 7 years ago

I would restrict the gridsearch to:

TfIdfVectorizer(use_idf=True|False, max_features=range(1000, 30000, stepsize=2000), norm='l1'|'l2')

the SVM options which were there before.

All the scaling probably won't have an effect if we just the SVM, so this will limit the possibilities enormously. CV can happen over 5 random splits or so.

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Mon, May 15, 2017 at 3:59 PM, Enrique Manjavacas < notifications@github.com> wrote:

@jedgusse https://github.com/jedgusse Ye, I think we can report that it didn't outperform tfidf vectorizer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/12#issuecomment-301483442, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJLwc7SD0BmySYZhCAyH2TTvYg3VO2ks5r6FpAgaJpZM4Na5FZ .

jedgusse commented 7 years ago

Works for me!

emanjavacas commented 7 years ago

Hi, the current setup amounts to a total of 960 items in the grid run 5 times for each cv and twice for both the normal classification and the data augmentation setting. This amounts to roughly 6h per experiment when run on all 12 cores of calc9. We need to cut down the search space. I propose removing kernels and perhaps makin n_features_options smaller.

jedgusse commented 7 years ago

I remember @mikekestemont mentioning that linear SVM's are generally the best-scoring kernels, and I believe that our experiments so far have always confirmed this theory.

emanjavacas commented 7 years ago

I am leaving in the linear and the rbf (for non-linear cases), but I think the major bottleneck is n_features_options.

emanjavacas commented 7 years ago

I am gonna change it to 1000, 3000, 5000, 10000, 15000, 30000

jedgusse commented 7 years ago

If you're running it on 12 cores you might also want to change this in the grid, more specifically the parameter n_jobs, but you probably thought of that.

mikekestemont commented 7 years ago

Agreed.

On 16 May 2017, at 10:49, jedgusse notifications@github.com wrote:

If you're running it on 12 cores you might also want to change this in the grid, more specifically the parameter n_jobs, but you probably thought of that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jedgusse / project_lorenzo

Tweak pipeline (vectorizers) #12