vehicle-lang / vehicle

A toolkit for enforcing logical specifications on neural networks
https://vehicle-lang.readthedocs.io/
Other
81 stars 7 forks source link

Similarity counterexamples sometimes misidentified #673

Open BenCoke12 opened 1 year ago

BenCoke12 commented 1 year ago

The similarity specification checks that for all images that are: a) valid images and b) have sneaker as their highest score prediction the property holds that: the score for sandal is greater than the score for pullover

When running this command a counterexample is produced that obeys these properties. When I add an additional requirement c) The score for sneaker is higher than some threshold the counterexamples produced are not valid, either because the score for sneaker is not high enough or because the score for sandal is greater than the score for pullover (the property holds for the counterexample).

The margin of difference I have seen is typically quite small i.e. the required score for sneaker may be 1 and the counterexample scores 0.99999887. Or the scores for sandal and pullover could be: Score of pullover: 0.67682445 Score of sandal: 0.6768253 (score for sandal is slightly larger that score for pullover).

Specification: https://github.com/BenCoke12/FashionVerification/blob/main/vclspecs/fashionSimilarity.vcl Network: https://github.com/BenCoke12/FashionVerification/blob/main/onnxnetworks/fashion1l32n.onnx

command: vehicle verify --specification vclspecs/fashionSimilarity.vcl --network classifier:onnxnetworks/fashion1l32n.onnx --verifier Marabou --verifierLocation ../Marabou/build/Marabou

testing script: https://github.com/BenCoke12/FashionVerification/blob/main/tools/testSimilarity.py The testing script contains four counterexamples I generated at different thresholds for sneaker score.

MatthewDaggitt commented 1 year ago

Thanks, will have a look.

MatthewDaggitt commented 1 year ago

Hmm this does seem to be larger than Marabou's precision: https://github.com/NeuralNetworkVerification/Marabou/blob/48a30dd7ee67e35fd9feba6d4e6d2b480ee8a38f/src/configuration/GlobalConfiguration.cpp#L35

BenCoke12 commented 1 year ago

I think this is also an issue for some of the counterexamples generated for robustness. The counterexample will be in the correct class but just slightly:

Counterexample predictions: [array([[ -4.3608975, -9.586009 , 1.8527187, -2.2896652, 6.23001 , -6.723092 , 6.2300096, -18.332565 , -3.3177466, -15.159739 ]],

Index 4 is the correct label, index 6 is just slightly lower.

MatthewDaggitt commented 1 year ago

Okay, I have a suspicion that it might be the code in Gaussian elimination where we transform it to floating point. I did this because I was trying to leverage dense native C vectors for speed, but it's no longer necessary as I switched to sparse Haskell vectors so we can continue to use Rational everywhere.

I'm going to try switching to Rational, however should also be careful that when we translate to strings for the Marabou backend we do so soundly (i.e. round away from the direction of the inequality)

MatthewDaggitt commented 1 year ago

Unfortunately making that change (https://github.com/vehicle-lang/vehicle/issues/673) didn't fix the problem. Will think some more on this.