paser-group / MLForensics

Placeholder for source code and other relevant artifacts for research project titled `Forensic Anti-patterns for Machine learning`
0 stars 0 forks source link

Finalizing Paper Story #5

Open akondrahman opened 3 years ago

akondrahman commented 3 years ago

Selling Point (Option-1)

Creating this issue so that discussion on definitions does not get lost. Here is how I am defining forensic anti-patterns:

Forensic anti-patterns for machine learning are absence of coding patterns in source code that are necessary to capture unexpected behaviors within a machine learning project.  

Note to self:

Counter-argument: forensic anti-patterns are hard to detect e.g. we can never conclusively say sth. is missing or not logged. If developers do not log X is the focus of the paper, then paper may get rejected.

akondrahman commented 3 years ago

Selling Point (Option-2)

Suggesting another selling point:

Possible Title: Tell Me What: Towards Security-focused Logging for Machine Learning Development Possible RQs:

RQ1: What security-related events can be logged for machine learning development? 
RQ2: How frequently do security-related events appear in machine learning development? How frequently are security-related events logged in machine learning implementations? 
RQ3: How do practitioners perceive the identified security-related events for machine learning? 
akondrahman commented 3 years ago

Selling Point (Option-3)

Possible RQs

RQ1: What categories of security-relevant code snippets can be logged for machine learning development?   
RQ2: How frequently do security-relevant code snippets appear in machine learning development? How frequently are security-relevant code snippets logged in machine learning development?
RQ3: How do practitioners perceive the identified security-relevant code snippets for machine learning development? 

The problem with security-relevant code snippets is that it can also include insecure coding snippets, which we are not detecting

akondrahman commented 3 years ago

Selling point 4 (building on option#2)

King has identified mandatory log events ... can we build on top of King to find security log events. King says A mandatory log event is an action that must be logged in order to hold the software user accountable for performing the action
We will say A security log event is an action expressed by source code elements that should be logged to perform post mortem analysis of security attacks in machine learning We will identify security log events for ML using:

  1. Manually inspect each Python file
  2. Identify source code elements a. that perform any of the following actions identified as mandatory by King: create, read, update, delete, print failure; and b. that can be used to conduct a security attack as reported by prior work

Another option is to say adversarial log event instead of security log event If we want to tone it down we can say Likely adversarial log events or candidate security log event instead of security log event

akondrahman commented 3 years ago

Useful definitions from Chuvakin's book:

An event is a single occurrence within an environment, usually involving an attempted state change
An event field describes one characteristic of an event An event record is a collection of event fields A log is a collection of event records Logging is the act of collecting event records into logs Alert or alarm is an action taken in response to an event, usually intended to get the attention of someone or sth.

akondrahman commented 3 years ago

Page#235 of Chuvakin's book to motivate the paper better

akondrahman commented 3 years ago

May be it will not be wise to submit bug reports ... it is possible that a lot of people will say no. Better to do a survey. Use page#2 as motivation from Security Engineering for Machine Learning

akondrahman commented 3 years ago

In the discussion section need to say why automated log assistant was not done and can be done in future ... groundwork, perceptions etc.

akondrahman commented 3 years ago

Selling point 5

Forensic events: A forensic event in machine learning is an action expressed by source code elements that should be logged to perform post mortem analysis of security attacks in machine learning

akondrahman commented 3 years ago

Selling point 6

Forensic-likely coding patterns can be one term that we can use. This will require submitting bug report that will not give us good response rate. Can frame it as categories of forensic-likely coding patterns and see if devs agree with that.

Example forensic-likely coding patterns are load, read methods used to read datasets for training.

Definition: forensic-likely coding patterns are recurring coding patterns that express a mandatory log event needed to perform post mortem analysis of security attacks.

Category names:

akondrahman commented 3 years ago

Selling point 7 (credit to @effat )

Limit scope by focusing on adversarial machine learning, like what to log to diagnose adversarial attacks on machine learning ... need to define:

Follow the train of thought: initially it was not clear why different from King, then definition of adversarial ML, then attack in the context of adversarial ML, then example attacks, how different actions map to attacks, interesting names like reinforcement learning environment

akondrahman commented 3 years ago

Selling Point 7 (Contd.)

What categories map to what attacks:
  1. Load training data can facilitate data poisoning attacks [https://ieeexplore.ieee.org/document/8406613 ]
  2. Load pre-trained model can facilitate model poisoning attacks [https://arxiv.org/pdf/1911.12562.pdf (Finding-11) ]
  3. Download data from remote source can facilitate attacks due to malformed input [https://ieeexplore.ieee.org/document/8424643][https://arxiv.org/pdf/2007.10760.pdf ]
  4. Load classification labels from file can facilitate label perturbation attack [https://ieeexplore.ieee.org/document/8406613 ]
  5. Load pipeline configuration can facilitate physical domain attacks [https://ieeexplore.ieee.org/document/8406613 ]
  6. Update in reinforcement learning environment can facilitate strategically timed attacks [https://www.ijcai.org/Proceedings/2017/525] and neural network policy attacks [https://research.google/pubs/pub46154/] and enchanting attacks [https://arxiv.org/pdf/1801.00553.pdf ]
  7. Reading model results can be used to detect model stealing attacks [https://ieeexplore.ieee.org/document/8979377][https://arxiv.org/pdf/1911.12562.pdf ]

policy attacks need policy detection ... is a set of steps and values ... see: https://stackoverflow.com/questions/46260775/what-is-a-policy-in-reinforcement-learning

akondrahman commented 3 years ago

Selling Point 7 (Contd.)

Names
akondrahman commented 3 years ago

Selling Point 7 (Contd.)

Possible Category Names (Version-1):
  1. Poisoned data forensics
  2. Model forensics
  3. Download forensics
  4. Classification label tracing
  5. Configuration forensics
  6. Policy forensics in reinforcement learning
  7. Prediction result tracking
akondrahman commented 3 years ago

@fbhuiyan42 ... hope you are following this thread. This is where you discuss and ask questions.

akondrahman commented 3 years ago

Selling Point 7 (Contd.)

Possible Category Names (Version-2):
  1. Poisonous training data
  2. Model poisoning
  3. Remote downloads
  4. Classification label perturbations
  5. Pipeline forensics
  6. Policy forensics in reinforcement learning
  7. Prediction result tracking
akondrahman commented 3 years ago

Selling Point 7 (Contd.)

Possible Category Names (Version-3 to accomodate supervised learning):
  1. Poisonous training data
  2. Model poisoning
  3. Remote downloads
  4. Classification label perturbations
  5. Pipeline forensics
  6. Prediction result tracking
fbhuiyan42 commented 3 years ago

Are we planning to present the paper only for supervised projects? I thought we are presenting all types of projects, the category "Policy forensics in reinforcement learning" being applicable only for reinforcement learning.

akondrahman commented 3 years ago

@fbhuiyan42

This will depend on how clear your project classification is: we will do analysis on projects that are clearly labeled as supervised, unsupervised, or reinforcement. As far as I can remember, you were confidently able to classify supervised learning projects. Correct me if I am wrong.

fbhuiyan42 commented 3 years ago

I am confident about the reinforcement projects also. But in that case, yes, I agree, without the unsupervised projects, it's better not to report the RL projects also.

akondrahman commented 3 years ago

Yes. We need to tell a consistent story. That is why we will skip reinforcement-related findings for this project. We will save the reinforcement results for a short paper or sth. after this one has a home.