project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Issue With Date Facets (from 1032) #37

Open gigamorph opened 4 months ago

gigamorph commented 4 months ago

Problem Description: In some situations (unknown as to the scenarios in which it's not working), the date facets when applied generate zero results.

Example: A search that includes this record https://lux.collections.yale.edu/view/activity/9ed8ad50-f405-4cf9-9bb7-180048eeb078 has a date of 2017-2017. Applying that generates a search that has no results.

Expected Behavior: When applying a facet, there should always be results. In this case it should include the referenced record, rather than no results.

Link: Note that clicking back from this state results in the facets not being updated. This is lux-web#1325 (closed)

**Dependency/Blocked

Link: https://lux.collections.yale.edu/view/results/events?q=%7B%22name%22%3A%22small-great%22%7D

Reference: https://git.yale.edu/lux-its/marklogic/issues/471#issuecomment-14626 included diagrams and modelling Old tix: https://git.yale.edu/lux-its/marklogic/issues/1032

Screenshot: image

Reference: https://git.yale.edu/lux-its/marklogic/issues/471#issuecomment-14626 included diagrams and modelling

Links:

ar2674 commented 4 months ago

From old tix: image

Example Queries: Top left red bar Top right red bar Top left green bar Top right green bar Middle green bar Bottom green bar

brent-hartwig commented 3 months ago

Expected Behavior: When applying a facet, there should always be results. In this case it should include the referenced record, rather than no results.

@jffcamp and @prowns, the expected behavior is not always achievable at scale. Facets are calculated on unfiltered results --results determined via indexes alone. We can investigate specific instances but need to reset expectations.

Investigating specific instances may surface a bug or means to index the data better.

brent-hartwig commented 3 months ago

The cited record is an event that started in Feb 2017 and ended in Jun 2017:

image

I am able to reproduce the issue by performing a Simple Search of events matching "Small-Great Objects: Anni and Josef Albers in the Americas, Yale University Art Gallery, New Haven" then using the date facet.

When I repeat as an Advanced Search, I am able to get the search result when start date is >= 2017 and/or end date is <= 2017. I cannot get the search result using the other operators, even ones that should. For instance, start date >= 2000 returns the 2017 result but start date > 2017 does not (yet should).

I suspect a bug in the data search pattern.

brent-hartwig commented 3 months ago

Revised on 20 Mar 24.

The following screenshot illustrates a search that may not return everything it should. At least though release1.12, this search will only return Objects that have a produced_by or created_by timespan where the start of the range is before Jan 1, 2023 and the end of the range is after Dec 31, 2023. I confirmed the first search result's values for the associated fields meet this criteria: 1988 and 2989, respectively. This appears to only cover the "middle green bar" portion of the overlapping date range.

cts.andQuery([
  cts.jsonPropertyValueQuery(
    'dataType',
    ['DigitalObject', 'HumanMadeObject'],
    ['exact']
  ),
  cts.andQuery([
    cts.fieldRangeQuery('itemProductionEndDateFloat', '>', 1704067199, [], 1),
    cts.fieldRangeQuery('itemProductionStartDateFloat', '<', 1672531200, [], 1),
  ]),
]);

image

brent-hartwig commented 3 months ago

@azaroth42 and @kkdavis14, for the example in the description, we have a disconnect between the query and data:

image

Frontend and backend interpretations:

From record https://lux.collections.yale.edu/data/activity/9ed8ad50-f405-4cf9-9bb7-180048eeb078: image

It's not clear to me what should change. What comes to your mind?

kkdavis14 commented 3 months ago

I think it should be: if there's no eventInitiatedEndDateFloat, does eventInitiatedStartDateFloat/seconds_since_epoch_begin_of_the_begin fall within the query.

I don't know how eventInitiatedEndDateFloat is configured but it needs changed--because in this case there really isn't one, it shouldn't be pulling from end_of_the_end. We don't have an end_of_the_begin here, and it's not required in Linked.art. If there needs to be an eventInitiatedEndDateFloat/seconds_since_epoch_end_of_the_begin where do we compute that?

in short, the query is about the start, and the second date we're using here is the end. with that said, perhaps we need to add a query for end date to events faceting.

brent-hartwig commented 3 months ago

@kkdavis14, LUX presently supports a single date-related search pattern. It is geared towards utilizing the start and end of a timespan where the operator is the deciding factor. This aligns with @azaroth42's argument/requirement that we should always deal in timespan ranges versus a single point in time. That said, I believe:

  1. It is difficult for users to understand how the backend is applying their date criteria.
  2. The frontend may need to change how it specifies date criteria/facets.
  3. Whether it be via backend code changes and/or data changes, we may need to close some disconnects.

A meeting may be the best way to proceed --after I have time to investigate additional date-related issue reports. My next comment will speak to another instance of excluding an item due to operating against two points in time vs. one.

kkdavis14 commented 3 months ago

sure thing. I am sure I do not understand the whole scope of the issue, but I know Rob is tied up with LUX for science right now so I took a crack at it. I agree with should deal in timespan ranges--if it's the correct timespan, which it doesn't look like we're doing right now.

brent-hartwig commented 3 months ago

@kkdavis14, you shared a search with me whereby the results were missing a result you expected. I modified the search to reduce the number of search results but it would include your search result had it not failed the end date portion of the search criteria.

The search is for objects containing "coffeepot" in the (primary) name that were created between 1730 and 1800. This yields 11 results and excludes https://lux.collections.yale.edu/data/object/000de659-fce8-4b8d-9b4a-c9bb3c97bc61, which was created sometime between 1790 and 1840. Because this object could have been created between 1790 and 1800, I believe this object should have been included in the search results. Unlike the previous example, the data has values for both indexes the query used. However, it required the end of the range to be 1800 or earlier, which would exclude this record as the end of its range is 1840. Something appears amiss with the "overlapping" portion of the overlapping date range search pattern. I will investigate and potentially follow-up via separate ticket given there is enough difference between the two examples.

brent-hartwig commented 3 months ago

This is a fun one. Below is my current understanding, some which may contradict previous statements.

brent-hartwig commented 3 months ago

The following is informed by conversations with and input from Rob, Kelly, and Peter.

High-Level Conclusions

  1. Versions of the date search pattern are yet to correctly implement open- and closed-ended ranges at the same time.
  2. The current implementation fully supports open-ended ranges but only partially supports closed-ended ranges. In the following diagram, only Bar E is supported for closed-ended ranges.

image

  1. There is a conflict between the dataset and search whereby the dataset expects search to apply default values to timelines (or otherwise include documents) yet search isn't geared to do so, resulting in scenarios when anticipated documents are excluded from search results.

Proposed Changes

Data

Backend

Middle Tier

TBD

Frontend

Example of requiring items encountered between 1600 and 1610, inclusive:

{
   "_scope":"item",
   "_comp":"=",
   "encounteredDate":"1600;1610"
}

Next Steps

Before pursing all of the above changes, we should test the direction we're heading:

  1. Collect/document a complete set of test cases.
  2. Update the search pattern.
  3. Test via query console or backend endpoint.
    • Verify results are accurate.
    • Document filtered and unfiltered counts. May need to explore options if pattern changes result in a greater difference between the two.
    • Skip Events. All other record types should be fair game.
brent-hartwig commented 1 month ago

Teams thread checking in with Rob and Kelly on the above-proposed timespan data changes.

roamye commented 3 weeks ago

From the teams thread link above, an issue was listed: https://github.com/project-lux/data-pipeline/issues/22. This issue is now closed. Is there anything else needed to move this forward with the proposed changes listed above?

cc: @prowns

brent-hartwig commented 3 weeks ago

Is there anything else needed to move this forward with the proposed changes listed above?

@roamye, if inclusive of the Next Steps suggested at the bottom of the same comment the Proposed Changes are in, then no. But I would advise against just diving into this ticket's implementation. I think we need a means to comprehensively address issues with date searches and be able to use that means in the future for regression testing. I'm open to what that means is, including bringing back automated unit tests.