For MultiPL-T, we skipped failing tests instead of discarding the whole problem. We should not do this for benchmarks for obvious reasons. While merging dev into main, we forgot to differentiate between the two. So I added a flag to enable skipping failing tests. By default, if a test fails translation, the whole problem is discarded (like it was before MultiPL-T changes).
For MultiPL-T, we skipped failing tests instead of discarding the whole problem. We should not do this for benchmarks for obvious reasons. While merging dev into main, we forgot to differentiate between the two. So I added a flag to enable skipping failing tests. By default, if a test fails translation, the whole problem is discarded (like it was before MultiPL-T changes).