pepkit / pipestat

Pipeline results reporting package
https://pep.databio.org/pipestat/
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

pipestat retrieval breaks with common prefix #159

Closed nsheff closed 5 months ago

nsheff commented 5 months ago

If I have two record identifiers that share a common prefix, pipestat will return the value for the shorter one even when you request the longer one.

This may be the root of this problem I found earlier in looper: https://github.com/pepkit/looper/issues/470

Watch this:

Create pipestatmanager object

psm3 = pipestat.PipestatManager(
    record_identifier="sample1",
    results_file_path="analysis/bug_test.yaml",
    schema_path="analysis/seqcol_pipestat_schema.yaml",
)

Here's my seqcol_pipestat_schema.yaml

title: Refget henge back-end schema
description: Allows pipestat to be used as a simple key-value store.
type: object
properties:
  pipeline_name: "refget"
  samples:
    type: object
    properties:
      value:
        type: string
        description: "Value of the object referred to by the key"

Report/retrieve results is broken if you use similar sample names:

psm3.report({"value": "abcdefg"}, record_identifier="sample1")
psm3.report({"value": "12345"}, record_identifier="sample")
psm3.retrieve_one("sample", "value")
# '12345'
 psm3.retrieve_one("sample1", "value")  # should return 'abcdefg'
# '12345'

And continuing:

psm3.report({"value": "This is a new value"}, record_identifier="sample1", force_overwrite=True)
psm3.retrieve_one("sample1", "value")
# '12345'

It seems that if anything matches the first prefix of the sample, it will return that somehow. This is a critical bug.

The problem is with retrieval, not with reporting, because the file itself is actually correct, it shows:

refget:
  project: {}
  sample:
    sample1:
      value: This is a new value
      pipestat_created_time: '2024-02-21 07:49:04'
      pipestat_modified_time: '2024-02-21 07:50:52'
    sample:
      value: '12345'
      pipestat_created_time: '2024-02-21 07:49:13'
      pipestat_modified_time: '2024-02-21 07:49:13'
donaldcampbelljr commented 5 months ago

Additional info: Confirmed it is just for the filebackend.

donaldcampbelljr commented 5 months ago

Found it: if record_identifier in filter_condition["value"]:

should be

if record_identifier == filter_condition["value"]:

I'm failing a couple of tests with this change so I need to investigate that.

donaldcampbelljr commented 5 months ago

This is fixed

nsheff commented 5 months ago

Can you add in a test that mimics the above? Like, have two samples named sample1 and sample, so one is a prefix of the other, and show you can retrieve them both correctly?

nsheff commented 5 months ago

Might be good to see if this solves https://github.com/pepkit/looper/issues/470 (which should probably move to pipestat)

donaldcampbelljr commented 5 months ago

Yes, I already added that test

nsheff commented 5 months ago

I can confirm the fix is working correctly in my hands, thanks.