Queries not respecting unchecked show fields in results for duplicating records

acbentley commented 2 years ago

Describe the bug When running a query with the show button unchecked for certain fields, the query result still duplicates records based on that field

To Reproduce Steps to reproduce the behavior:

Go to Default Collection Object query in freshfish database
Add record 1234 to query. Note that citation field has show unchecked
Run query
See error - CO 1234 is duplicated based on the record having two citations.

Expected behavior I would expect queries to respect unchecked status of fields for duplication as in Sp 6

Screenshots Capture2 Capture3

Desktop (please complete the following information):

OS: [e.g. Windows 10]
Browser [e.g. Chrome, Safari]
Specify 7 Version [e.g. 7.6.1]

Reported By Name of your institution

Additional context Add any other context about the problem here.

benanhalt commented 2 years ago

Based on the behavior of Sp6, turning off "Show" changes the logic of the QB field from a normal logical predicate to an existential quantifier. Further experimentation reveals that applying "not" operates on the existence quantifier and not the other way around. That is "not exists P" rather than "exists not P". It's not clear to me yet where in the join chain the quantification occurs in fields which traverse more than a single one-to-many relationship.

I believe a more principled way to achieve the desired behavior is to make use of the distinct option on queries. Arguably, all queries should return distinct tuples in a relational model. This has a side effect of disabling the linking results to the base table records. I think the best way to resolve this would be to check if the query includes a unique field from the base table, and include the links in that case.

benanhalt commented 2 years ago

cf. #1464

maxpatiiuk commented 2 years ago

See also the following comment from @mcruz-umich:

I have developed the following process for removing "Tissues" from our publications.

It involves downloading the zip file from DataExporter, then unzipping it, modifying the csv, and then re-zipping it and deploying.

There is a bug in the "Schema Mapper" in which it checks for duplicate records BEFORE running respecting "distinct" checkbox. This bug therefore means you cannot query at the gift or preparation level and then reduce multiple rows down to a single row for the same Cat # even if you have the columns set to "do not show". The "Schema Mapper" logic needs to process the "distinct" checkbox FIRST before checking for duplicates. Since that is not working in 6.8.00.... I have developed the following process for removing gift records and stripping "Tissue - N" from the aggregated preparations field.

OPENOFFICE is used below to leverage a visual view of the csv data such that rows can be sorted and deleted easily.

SP DATAEXPORTER: Export from DataExporter - this will include gifts and empty preparations

OPENOFFICE: Open the csv in OpenOffice, sort by "preparations" and delete rows that have none

SPECIFY: Run a query to get a list of the gifts

SUBLIME TEXT: Open the csv in SublimeText and search for the gift numbers and delete those rows REGEX: ,(gift_num_1|gift_num_2|gift_num_3|...gift_num_n),

SUBLIME TEXT: Search for "tissue" prepTypes and remove the string but keep the row. This will affect the Preparations-aggregation column and may result in a blank cell if it had only contained tissue REGEX: (tissue( - \d)?;? ?)|(; tissue( - \d)?)+

Open in OpenOffice again and sort by Catalog Number

Originally posted in https://github.com/specify/webportal-installer/issues/65#issuecomment-1228745228

realVinayak commented 11 months ago

I played around with Ben's idea (https://github.com/specify/specify7/issues/1450#issuecomment-1155534940) in https://github.com/specify/specify7/commit/95d2de77f0b96bc5e2d829a981484d0f448773ee.

Specifically, https://github.com/specify/specify7/blob/95d2de77f0b96bc5e2d829a981484d0f448773ee/specifyweb/stored_queries/utils.py#L106-L116

It also checks for uniqueness rules defined in business rules. The logic is that if the set of fields from the base table is a superset of any of the uniqueness rules, then we can add recordid in distinct mode. The implementation also considers scoping done by query builder For example, catalognumber and collection form a uniqueness rule, and if you selected catalognumber and text3 in query builder with collection object as base, and distinct checked, you can STILL get recordid (because query builder adds a collection scope). So, the above solution checks if {catalognumber, text3, collection} (from QB) is a superset of {catalognumber, collection} (from Uniqueness Rules).

The backend is mostly done (or at least we have a generic solution). The remaining work on backend is also to treat primary key in base table as global unique.

HOWEVER, frontend does not expect recordid when distinct, so currently it just crashes after backend adds recordid in distinct mode. We could replicate the logic of deciding whether to add recordid in distinct mode on the frontend, so it knows how many columns to expect, or could infer it from the response (via column count, very hacky) or a response field.

acwhite211 commented 9 months ago

I'm working on a PR #4541 to expand on the solution implemented in the issue-1450 branch that is based on the solution suggested by Ben and Vinny.

Based on the behavior of Sp6, turning off "Show" changes the logic of the QB field from a normal logical predicate to an existential quantifier. Further experimentation reveals that applying "not" operates on the existence quantifier and not the other way around. That is "not exists P" rather than "exists not P". It's not clear to me yet where in the join chain the quantification occurs in fields which traverse more than a single one-to-many relationship.

I believe a more principled way to achieve the desired behavior is to make use of the distinct option on queries. Arguably, all queries should return distinct tuples in a relational model. This has a side effect of disabling the linking results to the base table records. I think the best way to resolve this would be to check if the query includes a unique field from the base table, and include the links in that case.

I'm also looking into implementing a simpler fix brought up by Max https://github.com/specify/specify7/issues/1464#issuecomment-1286985113

When running a distinct query, can we mimic a distinct query using "GROUP BY" instead of "SELECT DISTINCT" and include the id in the query as a CONCAT(id, ", ") (which would be parsed on the front-end into separate records)?

I'm working on both solutions, anyone have opinions on which solution they prefer? @maxpatiiuk @realVinayak @melton-jason?

realVinayak commented 9 months ago

It would be more faster to do Max' solution imo. The backend change is easy and you could display the resources (after parsing) using RecordSelectorFromIds (see usage in deletion blockers if you want to see isolated usage).

CarolineDenis commented 9 months ago

Will be fixed in https://github.com/specify/specify7/issues/4607

acwhite211 commented 9 months ago

This issue is now moved into #4607

specify / specify7

Queries not respecting unchecked show fields in results for duplicating records #1450