open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Incremental updates with database store #1025

Closed jbothma closed 1 year ago

jbothma commented 1 year ago

I think there's a bug with the incremental update behaviour of the DatabaseStore.

If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.

The first time,

  1. it crawls
  2. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ...
  3. it creates one CSV file from all the crawl files
  4. it creates the OCDS data table and inserts the data from the CSV file

Subsequent times,

  1. it gets the latest publish date from the data in the table
  2. it crawls from that publish date
    1. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ... (I think it's overwriting files here)
  3. it creates one CSV file from all the crawl files
  4. deletes the existing data and inserts the data from the CSV file

expected: All the data crawled previously plus the new data should be in the database actual: Data in the overwritten files is missing from the database

Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.

Some options I see:

yolile commented 1 year ago

Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.

jpmckinney commented 1 year ago

If a crawl is performed twice with the same parameters, the filenames should be the same.

I think the simplest solution might be to prepend from_date to start.json and to set formatter in start_requests to something like join(pretty(self.from_date), parameters('page')) (where pretty is a new function that formats datetimes).

jpmckinney commented 1 year ago

The path and qs:* spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.

jbothma commented 1 year ago

Amazing. Thanks both!