Problems with ARFF & CSV Parsers

Hello guys,

first of all, thanks for your efforts and contribution for the streaming analytics community :)

I am doing right now my master thesis research on the topic of real-time streaming analysis for an imbalanced dataset. I tried to load the data as .csv by: stream = CSVParser(r"my_local_dir\datasets\azure_100_norm.csv") then, I tried to evaluate the data streams using the following lines:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
metrics = [accuracy_score, precision]
evaluator = TestThenTrain(metrics)
evaluator.process(stream, clf)

here I received the following Error:

AttributeError Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:94, in TestThenTrain.process(self, stream, clfs) 89 self.scores[clfid, stream.chunk_id - 1] = [ 90 metric(y, y_pred) for metric in self.metrics 91 ] 93 # Train ---> 94 [clf.partialfit(X, y, self.stream.classes) for clf in self.clfs] 96 if stream.is_dry(): 97 break

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:94, in (.0) 89 self.scores[clfid, stream.chunk_id - 1] = [ 90 metric(y, y_pred) for metric in self.metrics 91 ] 93 # Train ---> 94 [clf.partialfit(X, y, self.stream.classes) for clf in self.clfs] 96 if stream.is_dry(): 97 break

AttributeError: 'CSVParser' object has no attribute 'classes_'

Then I tried converting the data into ARFF and using the ARFFParser to deal with it. rest of the code remained the same. I got also the following error:

stream = ARFFParser(r"my_local_dir\datasets\azure_100_norm.arff")

IndexError Traceback (most recent call last) Input In [14], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:80, in TestThenTrain.process(self, stream, clfs) 78 pbar = tqdm(total=stream.n_chunks) 79 while True: ---> 80 X, y = stream.get_chunk() 81 if self.verbose: 82 pbar.update(1)

File ~\Anaconda3\lib\site-packages\strlearn\streams\ARFFParser.py:130, in ARFFParser.get_chunk(self) 128 X, y = np.zeros((size, self.n_attributes)), [] 129 for i in range(size): --> 130 if not self.a_line[-1] == "\n": 131 self.isdry = True 132 line = self.a_line

IndexError: string index out of range

Finally, I tried with the NumpyParser:

# stream = NPYParser(r"my_local_dir\datasets\azure_100_norm.npy")
evaluator.process(stream, clf)

IndexError Traceback (most recent call last) Input In [16], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:80, in TestThenTrain.process(self, stream, clfs) 78 pbar = tqdm(total=stream.n_chunks) 79 while True: ---> 80 X, y = stream.get_chunk() 81 if self.verbose: 82 pbar.update(1)

File ~\Anaconda3\lib\site-packages\strlearn\streams\NPYParser.py:80, in NPYParser.get_chunk(self) 78 self.previous_chunk = self.current_chunk 79 else: ---> 80 self.X, self.y = self._make_classification() 81 self.reset() 83 self.chunk_id += 1

File ~\Anaconda3\lib\site-packages\strlearn\streams\NPYParser.py:51, in NPYParser._make_classification(self) 48 def _makeclassification(self): 49 # Read CSV 50 ds = np.load(self.path) ---> 51 self.classes = np.unique(ds[:,-1]).astype(int) 52 return ds[:,:-1], ds[:,-1]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

DATA_USED

https://github.com/ZahirBilal/dataset_test

So, I am not sure where the problem lies. Unfortunately, I did not have enough time to go through each module, to test if the problem with the data files or the Parser Modules. However, I tried with other data files and encountered same errors.

A feedback would be really appreciated.

Also if the problem is within the code itself, I could also help and take a look, as Implementing the OOB and UOB algorithms are essential in my thesis

Much thanks and best regards Zahir

w4k2 / stream-learn

Problems with ARFF & CSV Parsers #30

DATA_USED