Closed simonw closed 2 years ago
I can teach it to only run against rows that don't have corresponding records in the pages_comprehend_entities
table (unless you pass --reset
or similar) and that will work... but it will still try again for any records that didn't return any entities.
I need a way of recording "this document has been processed but returned no entities".
Two options:
pages_comprehend_entities
table. This feels messy, it's likely to confuse people querying the data later on.pages_comprehend_entities_none
table to record that case. I think I like this option best.The other option would be a table that records a row for every input page that has been processed. This feels a bit more elegant than a _none
table that may or may not exist, but will take up very slightly more space.
Could call that pages_comprehend_entities_done
and just have the primary keys from the pages
table - the existence of a row means that it has been processed.
I could record a timestamp too but that will take up extra space and doesn't feel necessary.
I played with GPT-3 to see what it thought the best SQL query would be for "return just records that haven't been processed yet":
I'm going to have the command default to only running against new records.
The --reset
option can be used to discard that data and run from scratch.
It might be useful to have a mechanism in the future that lets you re-run against the documents specified with the --where
clause WITHOUT completely dropping the existing tables. Not sure how to design that option though.
Added a note to the README:
You can delete records from that
_done
table to run them again.
Currently every time you run
sqlite-comprehend entities
it runs against every row, costing you money for rows that may already have been processed.