online-ml / river

šŸŒŠ Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.99k stars 540 forks source link

Error when using OutputCodeClassifier for Code-size greater than 20 #1116

Closed Yasmen-Wahba closed 1 year ago

Yasmen-Wahba commented 1 year ago

Hello again,

I was testing the OCC classifier with more than 90 classes and the accuracy is very poor. I assume I need a huge code size, however I was testing different code-sizes (staring with a code-size of 10) and recording the accuracy when I came to a code-size of 40 and received the following error: OverflowError: Python int too large to convert to C ssize_t. Is there a way we can modify the occ classifier to allow for compact codes (as short as possible) while still providing enough discriminating power between the different classes.

MaxHalford commented 1 year ago

Hey there @Yasmen-Wahba. Would you be able to share some codeĀ and a dataset so we can reproduce what you're doing? Also, does a OVR classifier provide good performance or not? If it doesn't then OCC usually won't do better.

Yasmen-Wahba commented 1 year ago

The OVR crashes. Unfortunately, the dataset is confidential, but a similar dataset would be the consumer complaints benchmark published by the Consumer Financial Protection Bureau. https://www.consumerfinance.gov/data-research/consumer-complaints/ Since it's a hierarchical problem, I'm flattening the classes to be able to use River. For the Consumer Complaints, you could use the Product and Sub-product columns.

MaxHalford commented 1 year ago

I will look into this sometime this week :)

MaxHalford commented 1 year ago

So I downloaded the complaints dataset. However, I'm unsure what would be the target. Can you tell me?

{'Date received': '2022-12-12',
  'Product': 'Credit reporting, credit repair services, or other personal consumer reports',
  'Sub-product': 'Credit reporting',
  'Issue': "Problem with a credit reporting company's investigation into an existing problem",
  'Sub-issue': 'Their investigation did not fix an error on your report',
  'Consumer complaint narrative': '',
  'Company public response': '',
  'Company': 'Experian Information Solutions Inc.',
  'State': 'NC',
  'ZIP code': '27356',
  'Tags': '',
  'Consumer consent provided?': '',
  'Submitted via': 'Web',
  'Date sent to company': '2022-12-12',
  'Company response to consumer': 'In progress',
  'Timely response?': 'Yes',
  'Consumer disputed?': 'N/A',
  'Complaint ID': '6311723'}

By the way, to answer your question:

Is there a way we can modify the occ classifier to allow for compact codes (as short as possible) while still providing enough discriminating power between the different classes.

I don't think that's possible. I don't see an obvious way to improve that algorithm. It's probably just not good enough, or your problem is too difficult.

received the following error: OverflowError: Python int too large to convert to C ssize_t.

I would like to understand this better. What exact model were you using? Can you share some code?

Yasmen-Wahba commented 1 year ago

The target would either be 'Product' and 'Sub-Product' or 'Issue' and 'Sub-issue'. Product would be the Parent class and Sub-product would be the child. Or Issue as a parent and the sub-issue as the child (i.e., level-2) of the hierarchy.

Yasmen-Wahba commented 1 year ago

I am using PAC with code-size=30. occPAC = multiclass.OutputCodeClassifier(classifier=linear_model.PAClassifier(), code_size=30, seed=24) to classify the 'Product' and 'Subproduct' columns. I merged both into one 'Product-subproduct' as if it were a flat classification problem. Example, level-1 : Product= 'Bank Acccount' & Subproduct= 'Debts' . My category would be called 'Combined = BankAccount-debt' . So, if I have 12 classes in the first level and another 20 in the second level, that would mean a total of 20 classes. The problem lies in the large number of classes, so we can ignore the fact that the problem is hierarchical, I still can't use that classifier with a large number of classes presented in either the Product or the Sub-product because of the code-size. It seems this classifier is not meant for large multi-class problems ...

MaxHalford commented 1 year ago

What happens if you use a standard scaler before your PA? It's usually a good idea to scale your data:

model = multiclass.OutputCodeClassifier(
    classifier=preprocessing.StandardScaler() |Ā linear_model.PAClassifier(),
    code_size=30,
    seed=24
)
Yasmen-Wahba commented 1 year ago

I'm working with text, so my matrices are sparse (TFIDF vectors) which if scaled will be dense and the memory will definitely crashes :(

MaxHalford commented 1 year ago

Ok that's another problem. But what about the scaling? Does it improve the accuracy of your model?

Yasmen-Wahba commented 1 year ago

It's hard to tell as my session keeps crashing. I am using Google Colab with GPU support.

MaxHalford commented 1 year ago

But does it at least launch? I don't understand why it's crashing. Does it run for some while? Can you share the script you're running? It's really hard to help you right now, you'll not sharing even information.

Yasmen-Wahba commented 1 year ago
occPAC = multiclass.OutputCodeClassifier(classifier=preprocessing.StandardScaler() | linear_model.PAClassifier(), code_size=30, seed=24)
pipe_pac = Pipeline(('vectorizer',TFIDF(lowercase=True,ngram_range=(1,3))),('pac',occPAC))
X= df['FinalText']
y= df["DV_CATEGORY"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state = 0)
newdf = pd.concat([X_train, pd.DataFrame(y_train)], axis=1)

data = list(zip(*map(newdf.get, newdf)))

for text,label in data:

   pipe_pac = pipe_pac.learn_one(text,label)

image

Yasmen-Wahba commented 1 year ago

The cell runs for some time, not much though, few seconds .. then it crashes.. I can pay and update my RAM to a TPU, but that would defeat the whole purpose of my algorithm which is supposed to be efficient in terms of training/testing times.

Yasmen-Wahba commented 1 year ago

I forgot to mention that DV_CATEGORY has only 12 labels/classes so it is not much ...

MaxHalford commented 1 year ago

So one issue I see is that you're loading all the data in memory. I suggest doing things the River way and reading the data from the disk in a steaming fashion. That way you only have one sample in memory at a time. Indeed, there is little reason why RAM should be an issue.

Yasmen-Wahba commented 1 year ago

Hmm. Okay will try that ..

Yasmen-Wahba commented 1 year ago

I've been struggling to get the code working for a code-size higher than 25!. There is no way that code-size=30 is successful, with or without feature_scaling. I used stream.iter_array() to iterate over my array before calling learn_one, but the session crashes even before reaching the learning. So I'm getting this Runtime error when Im just initializing the classifier occPAC = multiclass.OutputCodeClassifier(classifier=preprocessing.StandardScaler() |linear_model.PAClassifier(), code_size=30, seed=24)

have you tried using a code-size this long before? on any dataset?

image

MaxHalford commented 1 year ago

have you tried using a code-size this long before? on any dataset?

No I haven't. I'll give it a try if I have some time these holidays. But still, I don't see why it would crash your RAM. Have you tried with your own laptop? There's really little reason for this to crash, as you should only have one sample in memory at a time.

MaxHalford commented 1 year ago

A piece of advice: you can put the scaling step before your OCC. That way you only have to scale the features once for all the models, instead of doing for each model:

# āŒ
(
    feature_extraction.TFIDF() |
    multiclass.OutputCodeClassifier(preprocessing.StandardScaler() | linear_model.PAClassifier())
)

# āœ…
(
    feature_extraction.TFIDF() |
    preprocessing.StandardScaler() |
    multiclass.OutputCodeClassifier(linear_model.PAClassifier())
)
Yasmen-Wahba commented 1 year ago

image image

I am using my own laptop, and the memory crashes before I even load my data, just by initializing the code size to be 30 occPAC = multiclass.OutputCodeClassifier(linear_model.PAClassifier(), code_size=30, seed=24) This line crashes immediately...I thought it might be because of having trigrams which increases the feature size drastically, so I removed this line from the cell and only ran the one which initializes the OCC and it crashes as soon as I set the code size to 30, the RAM reaches 12 GB and crashes ...

MaxHalford commented 1 year ago

I'm stumped. Do you mind sharing the Google Colab notebook?

Yasmen-Wahba commented 1 year ago

Here's a cleaned neat version ->https://colab.research.google.com/drive/1fRpN9lvePe_MW79LAJviiwielxKDEDCw?usp=sharing

Yasmen-Wahba commented 1 year ago

I'm stumped. Do you mind sharing the Google Colab notebook?

It should crash before you reach the last cell where you train your model.