simonw / sqlite-comprehend

Tools for running data in a SQLite database through AWS Comprehend
Apache License 2.0
6 stars 0 forks source link

Ability to run the command again against only fresh data #7

Closed simonw closed 2 years ago

simonw commented 2 years ago

Currently every time you run sqlite-comprehend entities it runs against every row, costing you money for rows that may already have been processed.

simonw commented 2 years ago

I can teach it to only run against rows that don't have corresponding records in the pages_comprehend_entities table (unless you pass --reset or similar) and that will work... but it will still try again for any records that didn't return any entities.

I need a way of recording "this document has been processed but returned no entities".

Two options:

  1. Record a row with nulls in it in the pages_comprehend_entities table. This feels messy, it's likely to confuse people querying the data later on.
  2. Create a pages_comprehend_entities_none table to record that case. I think I like this option best.
simonw commented 2 years ago

The other option would be a table that records a row for every input page that has been processed. This feels a bit more elegant than a _none table that may or may not exist, but will take up very slightly more space.

Could call that pages_comprehend_entities_done and just have the primary keys from the pages table - the existence of a row means that it has been processed.

I could record a timestamp too but that will take up extra space and doesn't feel necessary.

simonw commented 2 years ago

I played with GPT-3 to see what it thought the best SQL query would be for "return just records that haven't been processed yet":

669BF523-4B9D-4154-A91A-6C82C7A470A2
simonw commented 2 years ago

I'm going to have the command default to only running against new records.

The --reset option can be used to discard that data and run from scratch.

It might be useful to have a mechanism in the future that lets you re-run against the documents specified with the --where clause WITHOUT completely dropping the existing tables. Not sure how to design that option though.

simonw commented 2 years ago

Added a note to the README:

You can delete records from that _done table to run them again.