Create New Test Dataset for Local MarkLogic Environments

clarkepeterf commented 4 months ago

Problem Description: The current test dataset is very old and is missing many of the properties that are used in current indexes. So many searches and test cases do not work unless extra documents are imported.

Expected Behavior/Solution: Create an updated test dataset with properties that match current indexes

Requirements:

Needed for promotion:

~~- [ ] Wireframe/Mockup - Mike~~ ~~- [ ] Committee discussions - Sarah~~ ~~- [ ] Feasibility/Team discussion - Sarah~~ ~~- [ ] Backend requirements - TBD~~ ~~- [ ] Frontend requirements- TBD~~ ~~- [ ] Are new regression tests required for QA - Amy~~ ~~- [ ] Questions~~ ~~- List of questions for discussions. Answers should be documented within the issue.~~

UAT/LUX Examples:

Dependencies/Blocks:

~~- Blocked By: Issues that are blocking the completion of the current issue.~~ ~~- Blocking: Issues being blocked by the completion of the current issue.~~

Related Github Issues:

~~- Issues that contain similar work but are not blocking or being blocked by the current issue.~~

Related links:

~~- These links can consist of resources, bugherds, etc.~~

Wireframe/Mockup: ~~Place wireframe/mockup for the proposed solution at end of ticket.~~

roamye commented 3 months ago

@clarkepeterf - agenda. Unsure where this ticket should go.

clarkepeterf commented 3 months ago

Discussed in team meeting on 2024/08/07 - try crawling GitHub issues and grab records that were talked about in issues

clarkepeterf commented 3 months ago

May have problems with some records mentioned in issues because their URIs change over time

brent-hartwig commented 3 months ago

@clarkepeterf, I have a 593K dataset locally that Rob provided associated to data slices (#73). It only has YCBA and YUAG data yet is better than the one from 2022. Undoubtedly, there will be shortcomings but let me know if you'd like to give it a shot while this ticket is in the backlog.

As far as creating a new dataset, if we were given the opportunity and deemed better than getting from the pipeline, we could come up with a way to get a representative sample from a full dataset. I'm thinking some of each entity (combination of statically defined, well-connected records plus a specified number of random-ish ones) and connected records by specified predicates or all in the records found and keep going n hops. If we developed such a script, we could refresh our dataset as new datasets are created/updated. The script would take a while to run so we'd want to encapsulate it in CoRB, scheduled tasks, or possibly Flux.

cc: @prowns, @jffcamp, @azaroth42, @kkdavis14

prowns commented 3 months ago

From 8/7 team meeting notes: Scope of set: looking for records that have lots of permutations. Every type of document, Hit every index, Use all MT HAL links

@azaroth42 , @kkdavis14 and @prowns to make a first pass at features/records. @brent-hartwig -is there is a clever ML solution for this?

brent-hartwig commented 3 months ago

@prowns, I'm not aware of a feature or "database crawler" intended to export a subset of a dataset whereby the records are interconnected. As described above, I believe we could write such code and use it over and over again, as the dataset changes.

project-lux / lux-marklogic

Create New Test Dataset for Local MarkLogic Environments #263