logpai / loglizer

A machine learning toolkit for log-based anomaly detection [ISSRE'16]
MIT License
1.27k stars 423 forks source link

What's the difference between the data in HDFS.npz and the data transformed in load_HDFS from the full HDFS.log_structed.csv #76

Open Shine21497 opened 4 years ago

Shine21497 commented 4 years ago

Hi, I tried using full HDFS log data to reproduce benchmarking results, I use logparser/Drain to get the full HDFS.log_structed.csv, which has the same structure with HDFS_100k.log_structed.csv. I load the full HDFS.log_structed.csv and label file in HDFS_benchmark.py, just like you did in demo, but the results of PCA and IM are very different from the results showed in readme.(LR,SVM,DT results are similar) It seems that the data in HDFS.npz are different from the data generated from the full HDFS.log_structed.csv using the load_HDFS function. Even if I get the HDFS.npz, it's still hard to use without knowing this difference. Many thanks