CUML models not working with textattack library

farwashah6 commented 1 month ago

Hi. I am new to using GPU. I am working on adversarial machine learning and earlier I have used the Textattack library for one of my projects using Sklearn and Keras models. For that I created the customModelWrappers according to my models and they worked fine.

Now since my data is different and very big, I want to implement it using GPU for the same (sklearn) models, so I have to use CUML instead. But when I use CUML, and pass the cuml model to the CustomModelWrapper I created earlier, it gives me the following error len() of unsized object and then stops the execution.

Additional Info: For vectorisation of my data I am using CountVectorizer of cuml, which is the cause of this error. Instead when I use CountVectorizer of sklearn it does the attack but doesn't use much GPU resources (of course). If anyone has the same experience, please help me in this.

I am attaching important chunks of my code here.

vectorizer = cuml.feature_extraction.text.CountVectorizer(max_features=50)
x_train_vectorized = vectorizer.fit_transform(pd.Series(x_train))
x_test_vectorized = vectorizer.transform(pd.Series(x_test))

class CustomModelWrapper(ta.models.wrappers.ModelWrapper):

    def __init__(self, model, vectorizer):
        super().__init__()
        self.model = model
        self.vectorizer = vectorizer

    def __call__(self, text_input_list, batch=None):
        x_transform = self.vectorizer.transform(pd.Series(text_input_list)).astype(float)
        prediction = self.model.predict_proba(x_transform)
        return prediction

custom_model_wrapper = CuMLKNNWrapper(cuml_model, cuml_vectorizer)
attacker.attack_dataset()

beckernick commented 1 month ago

Thanks for surfacing this issue. Could you share a minimal reproducible example that includes the full error?

This could be due the same underlying issue as https://github.com/rapidsai/cuml/issues/5160 @dantegd @quasiben

farwashah6 commented 1 month ago

This is a sample code:

` import textattack as ta import cuml import sklearn as sk import pandas as pd from textattack.models.wrappers import ModelWrapper

def load_data():
    df_fake = pd.read_csv(f'datasets/isot/isot_Fake.csv')
    df_fake['label'] = 0
    df_true = pd.read_csv(f'datasets/isot/isot_True.csv')
    df_true['label']** = 1
    df = pd.concat([df_true, df_fake], ignore_index=True)
    x = df['text'].copy()
    y = df['label']

    train_samples, test_samples, train_labels, test_labels = sk.model_selection.train_test_split(x, y, test_size=0.5, random_state=42)
    return train_samples, test_samples, train_labels, test_labels, df

def vectorization(x_train, x_test):
    vectorizer = cuml.feature_extraction.text.CountVectorizer()
    train_vect = vectorizer.fit_transform(pd.Series(x_train))
    test_vect = vectorizer.transform(pd.Series(x_test))
    return train_vect, test_vect, vectorizer

    def model(x_train_vect, x_test_vect, y_train, y_test):
        classifiers = cuml.neighbors.KNeighborsClassifier()
        classifiers.fit(x_train_vect, y_train)
    accuracy = classifiers.score(x_test_vect, y_test)
    print(f'Accuracy: {accuracy}')
    return classifiers

class CuMLKNNWrapper(ModelWrapper):
    def __init__(self, model, vectorizer):
        self.model = model
        self.vectorizer = vectorizer

    def __call__(self, text_input, batch=None):
        x_transform = self.vectorizer.transform(pd.Series(text_input)).astype(float)
        prediction = self.model.predict_proba(x_transform)
        return prediction

def attack(cuml_model, df, cuml_vectorizer):
    custom_model_wrapper = CuMLKNNWrapper(cuml_model, cuml_vectorizer)
    recipe = ta.attack_recipes.TextFoolerJin2019.build(model_wrapper=custom_model_wrapper)

    data = [(row['text'], row['label']) for _, row in df.iterrows()]
    attack_args = ta.attack_args.AttackArgs(num_examples=20, parallel=True, num_workers_per_device=2, disable_stdout=True)
    dataset = ta.datasets.Dataset(data, input_columns=['text'])
    attacker = ta.Attacker(recipe, dataset, attack_args)
    attacker.attack_dataset()

if __name__ == '__main__':
    train_examples, test_examples, y_train, y_test, data_samples = load_data()
    x_train_tokens, x_test_tokens, vectorizer = vectorization(x_train=train_examples, x_test=test_examples)
    classifier = model(x_train_tokens, x_test_tokens, y_train, y_test)
    attack(classifier, data_samples, vectorizer)`

Error: Traceback (most recent call last): File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/attacker.py", line 591, in attack_from_queue result = attack.attack(example, ground_truth_output) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/attack.py", line 444, in attack goal_functionresult, = self.goal_function.init_attack_example( File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/goal_function.py", line 67, in init_attackexample result, = self.get_result(attacked_text, check_skip=True) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/goal_function.py", line 78, in get_result results, search_over = self.get_results([attacked_text], **kwargs) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/goal_function.py", line 95, in get_results model_outputs = self._call_model(attacked_text_list) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/goal_function.py", line 218, in _call_model outputs = self._call_model_uncached(uncached_list) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/goal_function.py", line 193, in _call_model_uncached return self._process_model_outputs(attacked_text_list, outputs) File "/home/farwa/vscode/venv/lib/python3.10/site-packages/textattack/goal_functions/classification/classification_goal_function.py", line 25, in _process_model_outputs scores = torch.tensor(scores) File "cupy/_core/core.pyx", line 1496, in cupy._core.core._ndarray_base.len TypeError: len() of unsized object

dantegd commented 1 month ago

I think @beckernick is correct, this is probably the same as #5160. We are planning to work on improvements and fixes for encoders and vectorizers very soon, including CountVectorizer so aiming to have a solution for this in a nightly version in the next few weeks.

farwashah6 commented 1 month ago

I think @beckernick is correct, this is probably the same as #5160. We are planning to work on improvements and fixes for encoders and vectorizers very soon, including CountVectorizer so aiming to have a solution for this in a nightly version in the next few weeks.

Thank you. Looking forward for the updates.

rapidsai / cuml

CUML models not working with textattack library #5904