feat: Expanding API surface area by adding single matcher: toBeValidSarifFor(definition)

scalvert commented 3 years ago

This PR pivots the intent of the library slightly from providing the buildMatchers function publicly and expecting consumers to generate their own definition matchers to using a single matcher: toBeValidSarifFor. This accomplishes a few things:

hides the internal details of the buildMatcher function
makes the library more user friendly by providing a more useful, typed matcher

This PR adds:

an npm run generate command, which generates the following
- a new Definition union type, which is the union of all available definitions in the SARIF schema
a new matcher: toBeValidSarifFor(definition)
- allows consumers to pass in a definition, which will match the sub-type of the SARIF schema

jeffersonking commented 3 years ago

Have we tried moving the call to ajv.compile "higher up"? Also would leveraging ajv.getSchema('#/definitions/___') simplify things?

I'm a bit hesitant about the API "spam". Some of these definitions (and thus APIs) are getting into esoteric parts of the spec.

scalvert commented 3 years ago

Have we tried moving the call to ajv.compile "higher up"? Also would leveraging ajv.getSchema('#/definitions/___') simplify things?

Could you comment on the specific file you're referring to? Also, can you elaborate on why you think this would be beneficial (potential cacheability of the compile operation, for example). Thanks!

Also would leveraging ajv.getSchema('#/definitions/___') simplify things?

Could you explain why you think it'd simplify things? Most of that API surface area is private, so if there's a strong reason to make the private API calls simpler, I'm all for it, but for this specific callout it seems arbitrary (I may be missing something!).

I'm a bit hesitant about the API "spam". Some of these definitions (and thus APIs) are getting into esoteric parts of the spec.

I understand the sentiment, and thought about this myself. Some of the reasons I went this way are

The entire schema is public, therefore all parts of that schema are public (however esoteric!) and we should look to support them.
Supporting only a subset of the schema means that we have to pick and choose what's supported and what isn't. That seems largely arbitrary, specifically if we're targeting tools and libraries that will be producing SARIF logs. Only supporting a portion of those logs in testing seems like a gap.
The API 'spam' is mitigated since you aren't required to import anything specifically.

Happy to discuss further!

jeffersonking commented 3 years ago

Currently it looks to be compiling per toMatchSarif...() call. Between moving ajv.compile and using ajv.getSchema, I think you may be able to reduce that number. At the extreme, I think just once for the module. Also, ajv.getSchema may be able to remove the need for getFragmentSchema. Some of this may require shared/global ajv instance.

scalvert commented 3 years ago

It does, because each compilation is per unique schema. I'm sure we could reduce the number, but it's also a one-time cost as the index is imported/required. Additionally, this is test infrastructure, so it may end up with us over optimizing code targeted for test infra that, while important to ensure the perf of, may not result in any meaningful gain.

I'd like to try to balance the cleanliness and readability of the code with the need to optimize. Does that make sense?

jeffersonking commented 3 years ago

Here's an approximation/sketch. We can skip getSchema and getValidatorAndSchema and leverage Ajv for the caching and schema fragments. Also, the current version seems to be unintentionally locked on a SARIF version when using definitionName so treating that as a separate issue.

import Ajv from 'ajv';
import fetch from 'sync-fetch';

const sarifV2SchemaUrl = 'https://raw.githubusercontent.com/schemastore/schemastore/master/src/schemas/json/sarif-2.1.0-rtm.5.json';
const ajv = new Ajv({
  schemaId: 'auto',
  validateSchema: false,
});
ajv.addSchema(fetch(sarifV2SchemaUrl).json());

function buildSarifMatcher(definitionName?: string) {
  const keyRef = definitionName
    ? `${sarifV2SchemaUrl}#/definitions/${definitionName}`
    : sarifV2SchemaUrl;
  const validate = ajv.getSchema(keyRef);
  return received => {
    // Transform errors to messages here.
    return {
      actual: received,
      message: () => 'foobar',
      name: 'toMatchSarif___',
      pass: validate(received),
    };
  };
}

scalvert commented 3 years ago

This seems fine to move to creating a single instance in module scope, so long as that doesn't cause unforeseen issues WRT ajv's state. Specifically, if there's a conflict between schema validations as the result of the instance's internal state, we may need to keep it as is.

Also, the current version seems to be unintentionally locked on a SARIF version when using definitionName so treating that as a separate issue.

I'm not sure what you mean here - could you elaborate?

scalvert commented 3 years ago

@jeffersonking and I synced up offline. It turns out that referencing specific definitions within the parent schema is natively supported in ajv, so we're moving to that vs creating a specific sub-schema and loading that.

We're also exploring the pros/cons of using autogenerated APIs for all definitions vs. a single API that can be passed a definition string to reference the schema's definition itself.

Example:

Auto-generated API matcher:

expect(result).toBeValidSarifResult();

vs.

Generic definition matcher:

expect(result).toBeValidSarifFor('result');

The former allows the consumer to pick from an available list of definition matchers vs. having to know the string that corresponds to the definition. It does add a larger API surface area, which potentially can seem overwhelming to consumers.

That said, if the goal is to make this very easy for consumers to onboard to using/testing parts of the SARIF spec, using the former won't require them to look up the definitions they're needing to test. The counter argument could be that if consumers are constructing SARIF, and specific parts of the schema based on a definition, they'd likely know that definition name already, and thus could simply pass it in as a param.

I'm open to other's opinions on this, as I'm not tied to any one particular direction.

cc/ @jeffersonking @rwjblue

rwjblue commented 3 years ago

Great feedback! I was just chatting with @scalvert about this. I like both of these APIs (I think they would both satisfy our requirements for testability). If folks don't have a strong preference one way or the other, lets go with the latter suggestion .toMatchSarifFor('result'). This path still allows us to easily add the other matcher functions if we want, but limits the initial API surface.

scalvert commented 3 years ago

Yep, I'm inclined to agree. The way the first one evolved was through my initial work in attempting to improve dynamic typing for the buildMatcher API. It evolved to creating APIs for each definition, but @jeffersonking correctly pointed out that a smaller API surface area may be slightly better/simpler. Moving to using this API is likely an incremental improvement to exposing the buildMatcher function directly, so I think we get the best of both worlds.

I'll make those changes now.

jeffersonking commented 3 years ago

Glad we're having this discussion. I would not object either way. One slight tip towards the "generic" approach is that the docs (assuming they're auto-generated) may be easier to read or at least fit on one page.

scalvert commented 3 years ago

OK, this PR is ready to go.

microsoft / sarif-js-sdk

feat: Expanding API surface area by adding single matcher: toBeValidSarifFor(definition) #13