Open BenCoke12 opened 1 year ago
Thanks, will have a look.
Hmm this does seem to be larger than Marabou's precision: https://github.com/NeuralNetworkVerification/Marabou/blob/48a30dd7ee67e35fd9feba6d4e6d2b480ee8a38f/src/configuration/GlobalConfiguration.cpp#L35
I think this is also an issue for some of the counterexamples generated for robustness. The counterexample will be in the correct class but just slightly:
Counterexample predictions: [array([[ -4.3608975, -9.586009 , 1.8527187, -2.2896652, 6.23001 , -6.723092 , 6.2300096, -18.332565 , -3.3177466, -15.159739 ]],
Index 4 is the correct label, index 6 is just slightly lower.
Okay, I have a suspicion that it might be the code in Gaussian elimination where we transform it to floating point. I did this because I was trying to leverage dense native C vectors for speed, but it's no longer necessary as I switched to sparse Haskell vectors so we can continue to use Rational
everywhere.
I'm going to try switching to Rational
, however should also be careful that when we translate to strings for the Marabou backend we do so soundly (i.e. round away from the direction of the inequality)
Unfortunately making that change (https://github.com/vehicle-lang/vehicle/issues/673) didn't fix the problem. Will think some more on this.
The similarity specification checks that for all images that are: a) valid images and b) have sneaker as their highest score prediction the property holds that: the score for sandal is greater than the score for pullover
When running this command a counterexample is produced that obeys these properties. When I add an additional requirement c) The score for sneaker is higher than some threshold the counterexamples produced are not valid, either because the score for sneaker is not high enough or because the score for sandal is greater than the score for pullover (the property holds for the counterexample).
The margin of difference I have seen is typically quite small i.e. the required score for sneaker may be 1 and the counterexample scores 0.99999887. Or the scores for sandal and pullover could be: Score of pullover: 0.67682445 Score of sandal: 0.6768253 (score for sandal is slightly larger that score for pullover).
Specification: https://github.com/BenCoke12/FashionVerification/blob/main/vclspecs/fashionSimilarity.vcl Network: https://github.com/BenCoke12/FashionVerification/blob/main/onnxnetworks/fashion1l32n.onnx
command:
vehicle verify --specification vclspecs/fashionSimilarity.vcl --network classifier:onnxnetworks/fashion1l32n.onnx --verifier Marabou --verifierLocation ../Marabou/build/Marabou
testing script: https://github.com/BenCoke12/FashionVerification/blob/main/tools/testSimilarity.py The testing script contains four counterexamples I generated at different thresholds for sneaker score.