Closed jbothma closed 1 year ago
Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.
If a crawl is performed twice with the same parameters, the filenames should be the same.
I think the simplest solution might be to prepend from_date
to start.json
and to set formatter
in start_requests
to something like join(pretty(self.from_date), parameters('page'))
(where pretty
is a new function that formats datetimes).
The path
and qs:*
spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.
Amazing. Thanks both!
I think there's a bug with the incremental update behaviour of the DatabaseStore.
If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.
The first time,
start.json
,PageNumber-2.json
, ...Subsequent times,
start.json
,PageNumber-2.json
, ... (I think it's overwriting files here)expected: All the data crawled previously plus the new data should be in the database actual: Data in the overwritten files is missing from the database
Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.
Some options I see: