sergioburdisso / pyss3

A Python package implementing a new interpretable machine learning model for text classification (with visualization tools for Explainable AI :octocat:)
https://pyss3.readthedocs.io
MIT License
336 stars 44 forks source link

Partial learn #15

Closed Slavenin closed 3 years ago

Slavenin commented 3 years ago

Hi! I have dataset on 900k records with 800 categories. But I can not train my model because 16gb RAM not enough. How I can train my model by part?

angrymeir commented 3 years ago

Hi @Slavenin,

you can split up your training set and train on them sequentially. I created a small Gist (which you can run in colab) that shows that it doesn't make a difference if you just use the plain train function.

However, I'm pretty sure, that for more advanced training such as hyperparameter search this approach might not be applicable. Maybe @sergioburdisso could elaborate a bit on that 😇?

Slavenin commented 3 years ago

Tnx! It works. But I have an error in the categories print image

sergioburdisso commented 3 years ago

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

# create a single "huge" document for each category by concatenating each of its documents
# then call the learn function for each one of the categories using "update=False"
clf.learn(huge_doc_cat_1, label_cat_1, update=False)
clf.learn(huge_doc_cat_2, label_cat_2, update=False)
....
clf.learn(huge_doc_cat_n, label_cat_n)  # <-- note that the last category shudn't use the "update=False" so that the model is finally updated

Of course, you can use a loop to implement the above code, I wrote it that way just to make the explanation simpler.

As pointed out by @angrymeir, when working with a big dataset, it is better to perform hyperparameter optimization using a sub-sampling of the dataset. For instance by using the stratified k-fold function of sklearn and then working with just a single fold (subset) to optimize the model (Note we're using "stratified" here to make sure at least one sample of each category is included in each split, in fact, it will try to fit the same amount of samples for each category in each training subset/split/fold).

Nevertheless, it is in the "TODO" list to the optimization of the current source code to be robust in relation to the size of the used dataset, especially in relation to the number of categories, for instance by using NumPy data structures (I have some work done on this regard but there's still work left to do).

(Thanks, @angrymeir for your valuable help, you rock buddy! :muscle:).

angrymeir commented 3 years ago

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category.
For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

Slavenin commented 3 years ago

Hi @Slavenin! that's weird, what type of labels are you working with? It would be nice if we could replicate this error locally so that we can fix it. Is it just a problem with the pring_category_info() function? the rest of the code worked well?

File names are id categories. Simply numbers. image 798 objects in folder test and train

You can also use the learn function to train the model "incrementally". By using "learn" instead of "train" you can speed up the process by disabling the update of the model that is performed automatically after each call to train, as illustrated below:

learn work fine!

Slavenin commented 3 years ago

Ah @sergioburdisso that makes sense!

@Slavenin If you can't provide the dataset it might give some insights to see the output of the Vocab. Size per Category. For that you could just replace the following code block with: print(category[NAME], len(category[VOCAB]))

I could imagine that you have a sample s = (x,y) in your train set, where x = "" and s the only sample in you dataset with label y thus vocab.size(y) = 0 - @sergioburdisso is that possible?

You're right. Еhe output of the Vocab. Size per Category: image

But I do not understand how to fix that. I have a category with only one record.

angrymeir commented 3 years ago

As far as I understand it, there's two issues.

  1. You have a sample with the label 219 for which no ngrams have been learned. As pointed out above one reason could be, that this sample is empty. So if the sample is empty why even keep it in the dataset? Another option could be, that the sample only contains characters that are not learned e.g. punctuation ( . , ! ?).

  2. One of the categories has only one record. This could impose imbalance problems. But also here without looking at your use case (e.g. hyperparameter optimisation or just using the default parameters) and the Vocab. Size distribution its hard to tell whether this will actually be a problem.

Slavenin commented 3 years ago
  1. No this sample not empty. image
  2. I want to exclude samples with fewer n entries

Does your lib work with any language?

angrymeir commented 3 years ago

I think @sergioburdisso can answer that way more competent :)

sergioburdisso commented 3 years ago

Hi @Slavenin!

Does your lib work with any language?

Yes, the model works independently of the language being used, however, the default preprocessing function ignores characters outside the "standard" ones (a-zA-Z), so, to prevent this behavior you should simply disable the default preprocessing using the prep=False argument with the train and predict functions, as follows:

clf.train(x_train, y_train, prep=False)
...
clf.predict(x_test, prep=False)

I've just also made a tiny update to the source code of the preprocessing function to consider all valid "word" characters (\w) instead of just the range a-zA-Z and I've already released a new version (0.6.4) with the patch fixing this issue, so updating the package (pip install -U pyss3) should also solve this problem (basically, now, by default, the library should work with any language).

Let us know if this solved your problem, and do not hesitate to re-open this issue in case it is needed.


Regarding the size of the dataset, I would like to point out two things:

  1. When working with big datasets it is also better to perform hyperparameter optimization not using n-grams, because the code is optimized (with NumPy) for the case when n-grams=1; when n-grams>1 the library runs much slower.

  2. As mentioned by @angrymeir, it is recommended to perform some "data cleaning" before feeding the model, such as removing documents with very few words or categories with very few documents, the more "balanced" is your data in relation to all categories, the better. Categories with very few words will probably add noise to your final predictive model.

PS: I'm really sorry for the delay, I'm currently on vacation :sunglasses: in the countryside :chicken:, with very limited Internet access (and more importantly, very limited electrical power xD). Take care guys! :muscle: