pepkit / pipestat

Pipeline results reporting package
https://pep.databio.org/pipestat/
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

pipestat should offer the option to include a history of results #177

Closed donaldcampbelljr closed 3 months ago

donaldcampbelljr commented 3 months ago
          I flipped force_overwrite to default to True. I will do the same in PyPiper. However, we will still need to allow pipestat to offer the ability for history of results:

In the longer term, pipestat should offer the option to include a history of results, and these should be stored somehow in the file (and database). This may not actually be too hard to implement; just add a 'history' function, and when something is overwritten, just move the old values into the history in a way that is an array, rather than a single value. Then, pipestat could offer a clear history function to remove old stuff, if desired, but otherwise, repeated reports of the same result will simply add to the history.

Originally posted by @donaldcampbelljr in https://github.com/pepkit/pipestat/issues/161#issuecomment-2037737804

donaldcampbelljr commented 3 months ago

Playing around with different options in PR #178. A POC is working for filebackend. But the results file grows quickly. I am contemplating moving this to a separate .history.yaml file that is parallel with results.yaml so that it is less messy.

test_pipe:
  project: {}
  sample:
    pypiperRecordIdentifier1:
      number_of_things: 300
      pipestat_created_time: '2024-04-04 14:16:54'
      pipestat_modified_time: '2024-04-04 14:16:54'
    RECORD1:
      number_of_things: 50000
      pipestat_created_time: '2024-04-04 17:23:56'
      pipestat_modified_time: '2024-04-04 18:28:59'
      name_of_something: Another_Name
      history:
        number_of_things:
          '2024-04-04 18:28:43':
            reported_result: 100
          '2024-04-04 18:28:58':
            reported_result: 50000
        pipestat_modified_time:
          '2024-04-04 18:28:43':
            reported_result: '2024-04-04 18:28:43'
          '2024-04-04 18:28:58':
            reported_result: '2024-04-04 18:28:58'
          '2024-04-04 18:28:59':
            reported_result: '2024-04-04 18:28:59'
        name_of_something:
          '2024-04-04 18:28:43':
            reported_result: Test_Name
          '2024-04-04 18:28:59':
            reported_result: Another_Name
    RECORD2:
      number_of_things: 300
      pipestat_created_time: '2024-04-04 17:23:56'
      pipestat_modified_time: '2024-04-04 18:28:56'
      name_of_something: Test_Name_Changed...Again
      history:
        number_of_things:
          '2024-04-04 18:28:45':
            reported_result: 100
          '2024-04-04 18:28:50':
            reported_result: 200
          '2024-04-04 18:28:54':
            reported_result: 300
        pipestat_modified_time:
          '2024-04-04 18:28:45':
            reported_result: '2024-04-04 18:28:45'
          '2024-04-04 18:28:48':
            reported_result: '2024-04-04 18:28:48'
          '2024-04-04 18:28:50':
            reported_result: '2024-04-04 18:28:50'
          '2024-04-04 18:28:52':
            reported_result: '2024-04-04 18:28:52'
          '2024-04-04 18:28:54':
            reported_result: '2024-04-04 18:28:54'
          '2024-04-04 18:28:56':
            reported_result: '2024-04-04 18:28:56'
        name_of_something:
          '2024-04-04 18:28:48':
            reported_result: Test_Name
          '2024-04-04 18:28:52':
            reported_result: Test_Name_Changed
          '2024-04-04 18:28:56':
            reported_result: Test_Name_Changed...Again
donaldcampbelljr commented 3 months ago

For now, I'm just continuing with the above approach for the file backend and have added a retrieve_history function which uses retrieve_one

donaldcampbelljr commented 3 months ago

Currently deletion will look something like this:

        name_of_something:
          '2024-04-04 18:28:43':
            reported_result: Test_Name
          '2024-04-04 18:28:59':
            reported_result: Another_Name
          '2024-04-04 18:59:29':
            reported_result: Another_Name
          '2024-04-04 20:02:40':
            reported_result: Another_Name
          '2024-04-04 20:03:28':
            reported_result: Another_Name
          '2024-04-04 20:05:54':
            deletion: ''

However, if the record is removed (this occurs if only the history, creation_time, and modified_time are all that is left), the history is also removed with the record.

donaldcampbelljr commented 3 months ago

Currently working on the db_backend, it appears as though we will also need to delete the history of the record when the primary record is removed (similar to file backend) because of "foreign key contraint"

Could not remove the result from the database. Exception: (psycopg.errors.ForeignKeyViolation) update or delete on table "default_pipeline_name__sample" violates foreign key constraint "default_pipeline_name__sample_history_source_record_id_fkey" on table "default_pipeline_name__sample_history"

However, I'm operating under the assumption that this is desirable anyway.