rpytel1 / log-strategy

Project conducted for Seminar in Machine Learning for Software Engineering. Aim of our research was to explore possible directions of Deep Learning solutions for log detection in a snippet of code.

0 stars 1 forks source link

Training components #5

Closed rpytel1 closed 5 years ago

rpytel1 commented 5 years ago

Training

50/50 split (logs/no logs)
Assume Apache Hadoop as ground truth.
Parse with AST
2000- 4000 samples

Validation

"honest" to ground truth
probably 90/10 split
1000 - 2000 samples

Open questions

What is a realistic distribution logs per "code"?
Training sample size?
How does proper validation work?
How big is the input size for the NN (function size in data sets)?

Components to be implemented

[x] get Hadoop source code
[x] preprocess with AST
[x] parse into functions
[x] preparing features and labels

rpytel1 commented 5 years ago

Abstract Syntax Tree Parser and parts of a preprocessing pipeline: https://github.com/jan-gerling/mmsr_repo_sim

rpytel1 commented 5 years ago

Notes for 20.09:

Check papers from interesting papers and give some summary on papers we read and how does it relate to our case
Parser: preprocess using AST
Parser: how to extract logging lines and later create features and labels

rpytel1 commented 5 years ago

Our Ideas

Idea 1: Transfer Learning for code2vec Idea 2: As a comparison SVM (perhaps other traditional ML tasks) Idea 3: Reduce the feature space( data ablation study)

rpytel1 commented 5 years ago

18.09 Notes:

Preporcessing TODOs:

remove comments -produce parser for Logs (Logger, log)