w4k2 / stream-learn

The stream-learn is an open-source Python library for difficult data stream analysis.
https://stream-learn.readthedocs.io
GNU General Public License v3.0
62 stars 20 forks source link

Problems with ARFF & CSV Parsers #30

Closed ZahirBilal closed 2 years ago

ZahirBilal commented 2 years ago

Hello guys,

first of all, thanks for your efforts and contribution for the streaming analytics community :)

I am doing right now my master thesis research on the topic of real-time streaming analysis for an imbalanced dataset. I tried to load the data as .csv by: stream = CSVParser(r"my_local_dir\datasets\azure_100_norm.csv") then, I tried to evaluate the data streams using the following lines:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
metrics = [accuracy_score, precision]
evaluator = TestThenTrain(metrics)
evaluator.process(stream, clf)

here I received the following Error:


AttributeError Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:94, in TestThenTrain.process(self, stream, clfs) 89 self.scores[clfid, stream.chunk_id - 1] = [ 90 metric(y, y_pred) for metric in self.metrics 91 ] 93 # Train ---> 94 [clf.partialfit(X, y, self.stream.classes) for clf in self.clfs] 96 if stream.is_dry(): 97 break

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:94, in (.0) 89 self.scores[clfid, stream.chunk_id - 1] = [ 90 metric(y, y_pred) for metric in self.metrics 91 ] 93 # Train ---> 94 [clf.partialfit(X, y, self.stream.classes) for clf in self.clfs] 96 if stream.is_dry(): 97 break

AttributeError: 'CSVParser' object has no attribute 'classes_'

Then I tried converting the data into ARFF and using the ARFFParser to deal with it. rest of the code remained the same. I got also the following error:

stream = ARFFParser(r"my_local_dir\datasets\azure_100_norm.arff")


IndexError Traceback (most recent call last) Input In [14], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:80, in TestThenTrain.process(self, stream, clfs) 78 pbar = tqdm(total=stream.n_chunks) 79 while True: ---> 80 X, y = stream.get_chunk() 81 if self.verbose: 82 pbar.update(1)

File ~\Anaconda3\lib\site-packages\strlearn\streams\ARFFParser.py:130, in ARFFParser.get_chunk(self) 128 X, y = np.zeros((size, self.n_attributes)), [] 129 for i in range(size): --> 130 if not self.a_line[-1] == "\n": 131 self.isdry = True 132 line = self.a_line

IndexError: string index out of range

Finally, I tried with the NumpyParser:

# stream = NPYParser(r"my_local_dir\datasets\azure_100_norm.npy")
evaluator.process(stream, clf)

IndexError Traceback (most recent call last) Input In [16], in <cell line: 1>() ----> 1 evaluator.process(stream, clf)

File ~\Anaconda3\lib\site-packages\strlearn\evaluators\TestThenTrain.py:80, in TestThenTrain.process(self, stream, clfs) 78 pbar = tqdm(total=stream.n_chunks) 79 while True: ---> 80 X, y = stream.get_chunk() 81 if self.verbose: 82 pbar.update(1)

File ~\Anaconda3\lib\site-packages\strlearn\streams\NPYParser.py:80, in NPYParser.get_chunk(self) 78 self.previous_chunk = self.current_chunk 79 else: ---> 80 self.X, self.y = self._make_classification() 81 self.reset() 83 self.chunk_id += 1

File ~\Anaconda3\lib\site-packages\strlearn\streams\NPYParser.py:51, in NPYParser._make_classification(self) 48 def _makeclassification(self): 49 # Read CSV 50 ds = np.load(self.path) ---> 51 self.classes = np.unique(ds[:,-1]).astype(int) 52 return ds[:,:-1], ds[:,-1]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

DATA_USED

https://github.com/ZahirBilal/dataset_test

So, I am not sure where the problem lies. Unfortunately, I did not have enough time to go through each module, to test if the problem with the data files or the Parser Modules. However, I tried with other data files and encountered same errors.

A feedback would be really appreciated.

Also if the problem is within the code itself, I could also help and take a look, as Implementing the OOB and UOB algorithms are essential in my thesis

Much thanks and best regards Zahir

jedrzejkozal commented 2 years ago

Hi, thank you for reporting this issue. There were some errors in ARFFParser and CSVParser. The fix was merged to master in lastest commit (238bb27f96aac88d55955cbe97d1711a262c10cd). You should be able to use it by downloading the latest version of the stream-learn repo and running make install (as described in our quick start guide). The next release of stream-learn will contain this fix.

For now, I'm closing this issue. If you have any other problems, please let us know.