Create Endpoint To Retrieve Number of Documents With Triples

clarkepeterf commented 2 months ago

Problem Description: Datasets can be invalid based on missing or incorrectly applied triples. There is no current way to do this automatically. We want to validate datasets based on the number of documents containing each triple predicate in an automated way.

Expected Behavior/Solution: Create an API endpoint to return the number of documents containing each triple predicate (call into https://github.com/project-lux/lux-marklogic/blob/release1.24/scripts/checkPredicates.js) Requirements:

[ ] Create endpoint that returns output of https://github.com/project-lux/lux-marklogic/blob/release1.24/scripts/checkPredicates.js
[ ] Make sure endpoint is only accessible by role(s) defined in Needed for promotion below
[ ] Not ML requirement but - for QA - if the current dataset passes the test (hasn't strayed too much from the previous dataset) - should the current dataset become the new baseline?

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

~~- [ ] Wireframe/Mockup - Mike~~ ~~- [ ] Committee discussions - Sarah~~ ~~- [ ] Feasibility/Team discussion - Sarah~~ ~~- [ ] Backend requirements - TBD~~ ~~- [ ] Frontend requirements- TBD~~ ~~- [ ] Are new regression tests required for QA - Amy~~

Questions
- [ ] @azaroth42 - if the current dataset passes the test (hasn't strayed too much from the previous dataset) - should the current dataset become the new baseline?
- [ ] @azaroth42 @brent-hartwig Do we want to restrict the endpoint to the current user (instead of an admin specifying to run as a specific user?) QA can then just call the endpoint with each user they intend to test (they don't even need creds, could just call through the middle tier which has the appropriate credentials for each environment)

UAT/LUX Examples:

Example endpoints from LUX to use for development and testing, if applicable.

~~Dependencies/Blocks:~~

~~- Blocked By: Issues that are blocking the completion of the current issue.~~ ~~- Blocking: Issues being blocked by the completion of the current issue.~~

~~Related Github Issues:~~

~~- Issues that contain similar work but are not blocking or being blocked by the current issue.~~

~~Related links:~~

~~- These links can consist of resources, bugherds, etc.~~

~~Wireframe/Mockup:~~ ~~Place wireframe/mockup for the proposed solution at end of ticket.~~

brent-hartwig commented 2 months ago

if the current dataset passes the test (hasn't strayed too much from the previous dataset) - should the current dataset become the new baseline?

For the comparison part, perhaps the script could:

Be fed the baseline estimates and calculate the percent change for each.
Return in a format that can be easily captured for future reference/use.

brent-hartwig commented 2 months ago

Do we want to restrict the endpoint to the current user (instead of an admin specifying to run as a specific user?) QA can then just call the endpoint with each user they intend to test (they don't even need creds, could just call through the middle tier which has the appropriate credentials for each environment)

If we want a non-admin to be able to run this, yes, it would need to be limited to the endpoint consumer. That check would likely be for the full LUX endpoint consumer, which is to all documents but wouldn't help us know if the dataset was trying to slight a unit. I'm open to either way --we can change if needed.

Actually, the script has a configuration section. If that was turned into endpoint parameters, it could support both. Endpoint consumers that do not have the ability to invoke code as another user better only specify their username :)

clarkepeterf commented 2 months ago

@brent-hartwig My thinking is that the API can be called can as any endpoint consumer - then QA could run the script as lux-endpoint-consumer for the full dataset, and as lux-ypm-endpoint-consumer for a slice -respectively. Just have to keep baselines per-slice

clarkepeterf commented 2 months ago

@brent-hartwig Also, I was thinking ML would just spit out the numbers for each predicate. QA would handle the comparison on their side. But I'm happy to implement the comparison in ML too.

brent-hartwig commented 2 months ago

Yeah, I guess this is a good example of security over convenience. We could also provide a service account to QA that has the ability to invoke code as another user yet is not a full admin.

brent-hartwig commented 2 months ago

There are additional data validation-related scripts available. We could consider introducing the endpoint as a generic dataset reporting endpoint, adding more checks in as we go. Here are some of the others that came to mind. Those with enough value can be incorporated as part of this ticket or another.

All that come to mind:

checkPredicates.js: the script we've already been discussing.
comparePredicates.js (new!): compares all predicates in the dataset to those configured by the backend in order to surface a) configured predicates that do not exist and b) predicates that exist but are not configured.
getRecordTypesByPredicates.js: this one goes another level down by specifying which record types the predicates are found in. It could surface, for example, an agent predicate being in Person but not Group, which would be an issue if expected in both. The current version takes a couple minutes to run. I haven't looked to see if it could be optimized.
indexComparisonChecks.js: the primary data validation role is to check for fields and field range indexes the code is dependent on but not defined by the database. The top of the script lists limitations.
Number of values in fields: it's good to make sure the current dataset loaded all depended upon fields. Given #290 will likely take out our field value constants too, we may want (if don't already have) a script to count the values in fields. We could add this to indexComparisonChecks.js but check all configured in the database (versus the subset the script was able to confirm is referenced by the code).

roamye commented 1 week ago

TF 11/20: Discussed validation - questions about running internally vs. QA. Agenda for 11/22 team mtg.

project-lux / lux-marklogic

Create Endpoint To Retrieve Number of Documents With Triples #306