ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
10 stars 7 forks source link

URL launch with launch timestamp not forcing recrawl #69

Open anjackson opened 3 years ago

anjackson commented 3 years ago

Launching a URL like this doesn't seem to work:

{
  "headers": {},
  "method": "GET",
  "parentUrl": "https://charmed-chemical-sky.glitch.me/",
  "parentUrlMetadata": {
    "pathFromSeed": "",
    "heritableData": {
      "refreshDepth": 1,
      "source": "longtime",
      "heritable": [
        "source",
        "heritable",
        "refreshDepth"
      ],
      "annotations": [
        "launchTimestamp:20300301120000"
      ],
      "launchTimestamp": "20300301120000",
      "launch_ts": "20300301120000"
    }
  },
  "isSeed": false,
  "forceFetch": false,
  "url": "https://charmed-chemical-sky.glitch.me/",
  "hop": "",
  "timestamp": "2021-06-15T15:30:57.426402"
}

But with forceFetch it seems to work okay...

{
  "headers": {},
  "method": "GET",
  "parentUrl": "https://charmed-chemical-sky.glitch.me/",
  "parentUrlMetadata": {
    "pathFromSeed": "",
    "heritableData": {
      "refreshDepth": 1,
      "source": "longtime",
      "heritable": [
        "source",
        "heritable",
        "refreshDepth"
      ],
      "annotations": [
        "launchTimestamp:20300301120000"
      ],
      "launchTimestamp": "20300301120000",
      "launch_ts": "20300301120000"
    }
  },
  "isSeed": false,
  "forceFetch": true,
  "url": "https://charmed-chemical-sky.glitch.me/",
  "hop": "",
  "timestamp": "2021-06-15T15:31:42.264611"
}
anjackson commented 3 years ago

Ah, this will be the issue where the BdbUriUniqFilter can get in the way. When we are asking the crawler to re-prioritise a URL that is already somewhere in the frontier, we need to force it to be accepted.

anjackson commented 3 years ago

Actually, I'll leave this open until I've documented it somewhere.