pepkit / pipestat

Pipeline results reporting package
https://pep.databio.org/pipestat/
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

Pipestat interface suggestions #158

Closed nsheff closed 3 months ago

nsheff commented 5 months ago

clutter in return value

It's distracting that every time I retrieve a value, I get all this other stuff (created time, modified time, record identifier):

psm["sample2"] = {"value": "hijklmn"}
>>> psm["sample2"]
{'value': 'hijklmn', 'pipestat_created_time': '2024-02-20 21:44:13', 'pipestat_modified_time': '2024-02-20 21:44:13', 'record_identifier': 'sample2'}

I would expect it to just say:

psm["sample2"] = {"value": "hijklmn"}
>>> psm["sample2"]
{'value': 'hijklmn''}

why can't retrieve use the class record identifier?

I can do this:

psm = pipestat.PipestatManager(
    record_identifier="sample1",
    results_file_path="analysis/temp.yaml",
    schema_path="analysis/seqcol_pipestat_schema.yaml",
)
psm.report({"value": "abcdefg"})

It knows to put it on the record id I started with. But then why doesn't this work?

psm.retrieve_one("value")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 101, in inner
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 713, in retrieve_one
    raise RecordNotFoundError(f"Record '{record_identifier}' not found")
pipestat.exceptions.RecordNotFoundError: Record 'value' not found

That's confusing to me

docs are wrong

following the tutorial, docs say this: image

But I don't get None -- I get an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 101, in inner
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 713, in retrieve_one
    raise RecordNotFoundError(f"Record '{record_identifier}' not found")
pipestat.exceptions.RecordNotFoundError: Record 'sample1' not found

There are many other problems with the tutorial. Someone needs to actually go through it and run it. https://pep.databio.org/pipestat/code/python-tutorial/

Reporting result is returning a list

psm.report({"value": "arg"}, record_identifier="sample2")
["Reported records for 'sample2' in 'refget' :\n - value: arg", "Reported records for 'sample2' in 'refget' :\n - pipestat_created_time: 2024-02-20 21:53:22", "Reported records for 'sample2' in 'refget' :\n - pipestat_modified_time: 2024-02-20 21:53:22"]
donaldcampbelljr commented 3 months ago

Regarding clutter in return value: The above example (>>> psm["sample2"]) is a simple wrapper around retrieve_one: https://github.com/pepkit/pipestat/blob/1fe44eb33b353f9b2ea927cd58f604c39e53db54/pipestat/pipestat.py#L285-L288

retrieve_one currently returns everything for that record unless you specify result identifiers.

donaldcampbelljr commented 3 months ago

Regarding

why can't retrieve use the class record identifier?

With the above example, psm.retrieve_one("value") , pipestat thinks that value is the record_identifier.

image

However, if I attempt to specify the arguments, result = psm.retrieve_one(result_identifier="value"), I see that there is a problem and pipestat is not pulling the class record_identifier appropriately.

psm1 = PipestatManager(record_identifier="RECORD1", schema_path="sample_output_schema.yaml", results_file_path="results.yaml", pipeline_type="sample")
psm1.report(values={"number_of_things": 100})
result = psm1.retrieve_one(result_identifier="number_of_things")

Error

  File "/home/drc/PythonProjects/pipestat/opt_dependencies/venv6/lib/python3.10/site-packages/pipestat/pipestat.py", line 708, in retrieve_one
    raise RecordNotFoundError(
pipestat.exceptions.RecordNotFoundError: Results '['number_of_things']' for 'None' not found
donaldcampbelljr commented 3 months ago

Regarding

Reporting result is returning a list

It appears that this was intentional: https://github.com/pepkit/pipestat/commit/acf3a545c1fccd012a5ac12e50b79224129417e7

We wanted to return formatted results. For example, I know that PyPiper takes this returned formatted value and logs it. Is there an issue with it being a list? Should we just concatenate the formatted results into one big string?

nsheff commented 3 months ago

We wanted to return formatted results.

I think the return value of the report function should not be a list of strings. It should be some recognition about what was reported that is machine-understandable. There could be a separate function to take that programmatic output and produce a formatted string if that's what someone wants.

generally you want a function to return something that can be used more universally, unless the whole point of the function is just to format a string.

donaldcampbelljr commented 3 months ago

I've made associated issues for initial comment, so I will mark this specific issue as solved (close it) and work on the child issues.