Closed anjackson closed 5 years ago
Looking at the implementation, having a sheet for each Target (that needs one) seems like it should work fine. Implemented a basic version that always creates a custom sheet for each URL.
This is overkill for URls that are just being submitted via nomination/patching, so need to make it happen only if needed. i.e. if it's a Real Target Launch - not sure how best to indicate that. Perhaps simplest just to get the user pass in a map of key-value pairs, but not sure how to ensure we get all the types right.
Relying on JSON serialisation seems to be fine. New implementation (b2f3862de1d04f13fb508da7fe8fce4b7e91921e) allows a custom targetSheet
to be set when submitting a launch request, allowing arbitrary properties to be set at that level.
We now have
launchTimestamp
via sheetlaunch_ts
via property (which could be inherited)recrawlInterval
via sheetrecrawlInterval
as a property (which could be inherited)We don't need all these, but we do need a way of saying "get a updated version of this specific URL only" rather than only re-crawling a whole SURT. So, cleaning up the property version, and allowing that to take precedence seems appropriate.
Not sure we need the recrawlInterval
at all now.
Minor crawl problems observed due to skipping the disposition chain too early, which meant we were failing to clear the uriUniqFilter
memory of a URL if it got rejected by an OUT_OF_SCOPE
decision. Fixed in d0a82ba and to be rolled into 2.4.6
.
Much better, though still some oddities. The less-than-24-hours re-crawl frequency now backfires a bit, permitting over-crawling. The launch-timestamp mechanism works better so recrawl-frequency should be removed/phased out.
Sites manually checked appear fine, except BBC News oddly. Currently investigating.
Hm, it's acting like it's queued but low priority. This again can easily happen with the short re-crawl frequency - it can get rediscovered via long hop-paths, which get low priority, and then struggle to get to the front of the queue as new/unqueued seed launches come in.
This morning I modified the launcher to avoid using the recrawl internal. Hopefully daily targets will be more reliable now.
Working better, although some seeds did not start working because they had been rediscovered and queued with low priority.
This is perhaps the remaining problem. If target URLs get discovered before being officially re-launched, and get enqueued, then the re-launch doesn't get through (because the Frontier's unique URI filter cache already has it). e.g. I just re-launched, and it re-consumed the crawl queue, but re-used the scope, so the BBC News homepage may get discovered and enqueued before the launch request comes in.
For frequent crawling, it probably makes the most sense NOT to re-use the scope between crawls, as we're not re-using anything else. That should avoid the issue.
Alternatively, we could permit seeds and forcing to add the request to the frontier even if it's already present. The existing filter does support that, if we set forceFetch
when submitting URLs. For our seeds I think this makes sense, even though it may allow a small amount of over-crawling.
Actually as of 2.4.8
using forceFetch
is a bad idea as it bypasses the RecentlySeen filter (i.e. it really does force a fetch). I'm modifying the RecentlySeen to make this optional (as for our crawls, it doesn't make sense to force the crawl if it's been seen recently enough). So, 2.4.9
should be good for using forceFetch
to ensure we get enqueued into the frontier, even if the crawl turns out to be unnecessary later.
Verified as now working fine.
The
launch_ts
approach works well, but when we have a large Target with multiple individual page-level Targets (e.g. BBC News homepage versus individual articles), the current implementation tends to over-crawl. e.g. if a particular article is re-crawled, it sets a newlaunch_ts
and as this in inherited for discovered URLs, it causes the whole site to get re-crawled.In practice, we probably only want to inherit the
launch_ts
for the homepage of each site.launch_ts
and rely on the recrawl sheet frequencylaunch_ts
for URLs with no path (this will not work quite as expected if the whole host does not have a suitable record)In truth, we want surt-prefix-scoped
launch_ts
values. i.e. rather than inherit thelaunch_ts
value directly, add a new mechanism that gets configured when the URL comes in. This is like the sheets mechanism, but sheets are quite difficult to use for this, as you'd need to have a sheet for every SURT prefix, with thelaunch_ts
set. It seems likely that this large number of sheets would not work reliably.An alternative would be to create a new
Processor
that gets configured with the SURT-to-launch_ts
mapping and applies the right one prior to scoping of candidates.