muchdogesec / history4feed

Creates a complete full text historical archive for an RSS or ATOM feed.
https://www.dogesec.com/
Apache License 2.0
93 stars 1 forks source link

Start logging successful URLs in jobs #18

Closed himynamesdave closed 1 month ago

himynamesdave commented 1 month ago

The app should log the URLs downloaded and failed against a job.

I've noticed some sites occasionally throw timeouts (quite regularly).

To help ensure consistency we should start logging

We should start logging

urls_successful = 200 urls_failed = 40x error/timeouts/50x error urls_skipped = remote feeds that are ignored (if applicable), see #17 urls_pending = urls yet to be processed

Current response looks like;

  "jobs": [
    {
      "id": "6e536c66-2546-492a-aa86-70211ef2072d",
      "feed_id": "f3b2b84d-7a6e-5998-963c-60289c310d28",
      "state": "running",
      "run_datetime": "2024-07-30T14:53:34.575593Z",
      "earliest_item_requested": "2020-01-01T00:00:00Z",
      "latest_item_requested": "2024-07-30T14:53:34.575178Z",
      :
      "info": ""
    }
  ]
  "jobs": [
    {
      "id": "6e536c66-2546-492a-aa86-70211ef2072d",
      "feed_id": "f3b2b84d-7a6e-5998-963c-60289c310d28",
      "state": "running",
      "run_datetime": "2024-07-30T14:53:34.575593Z",
      "earliest_item_requested": "2020-01-01T00:00:00Z",
      "latest_item_requested": "2024-07-30T14:53:34.575178Z",
      "urls": [
           "successful": [
                 "https://google.com",
                 "https://xyz.com"
            ],
            "failed": [
                  "https://123.com"
            ],
            "skipped": [],
            "pending": [],
      ],
      "info": ""
    }
  ]
himynamesdave commented 1 month ago

@fqrious blocked by #20

himynamesdave commented 1 month ago

@fqrious response for skipped URLs seems hardcoded:

curl -X 'GET' \ 'http://127.0.0.1:8000/api/v1/jobs/' \ -H 'accept: application/json'

{
  "page_size": "50",
  "page_number": 1,
  "page_results_count": 1,
  "total_results_count": 1,
  "jobs": [
    {
      "id": "465a8166-c2c1-4c61-9afc-8f1a1f753260",
      "count_of_items": 20,
      "feed_id": "f3b2b84d-7a6e-5998-963c-60289c310d28",
      "urls": {
        "retrieved": [
          "https://grahamcluley.com/smashing-security-podcast-377/",
          "https://grahamcluley.com/smashing-security-podcast-371/"
        ],
        "skipped": [
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link",
          "https://example.com/blog-link"
        ],
        "failed": [],
        "retrieving": []
      },
      "state": "success",
      "run_datetime": "2024-07-31T16:51:19.647871Z",
      "earliest_item_requested": "2020-01-01T00:00:00Z",
      "latest_item_requested": "2024-07-31T16:51:19.647382Z",
      "info": ""
    }
  ]
}

Should show the actual url of skipped page

fqrious commented 1 month ago

I can't, we don't have the url because these jobs ran before the update and the link for those jobs are not saved... it'll only show on new jobs.

himynamesdave commented 1 month ago

@fqrious ah, ok. that's fine then

himynamesdave commented 1 month ago

@fqrious it's working well, except we need to account for slight variations on the domain.

To demo, for https://www.grahamcluley.com/feed/

See how https://grahamcluley.com is retrieved but https://www.grahamcluley.com is not.

We should consider a domain i

e.g. https://www.grahamcluley.com/feed/ entered as a value should consider all urls that match the pattern

*grahamcluley.com

so posts from

http://grahamcluley.com https://grahamcluley.com http://www.grahamcluley.com http://sub.grahamcluley.com

would all match.

Example of current output...

{
  "id": "3ecdce7a-94c9-4c9e-b46e-7e307443e334",
  "count_of_items": 561,
  "feed_id": "846d036f-1472-5708-bb75-f6e3b95e350f",
  "urls": {
    "retrieved": [
      "https://grahamcluley.com/ebrd-hacker-twitter/",
      "https://grahamcluley.com/feed-sponsor-av-comparatives/",
      "https://grahamcluley.com/porn-wielding-zoom-bombers-disrupt-twitter-hack-court-hearing/",
      "https://grahamcluley.com/google-porn-titles-train-station-search-results/",
      "https://grahamcluley.com/smashing-security-podcast-190-twitter-hack-arrests-email-bad-behaviour-and-fawkes-vs-facial-recognition/",
      "https://grahamcluley.com/garmin-staggers-back-online-after-ransomware-attack/",
      "https://grahamcluley.com/twitter-phone-spear-phishing/",
      "https://grahamcluley.com/a-scam-letter-warn-your-vulnerable-loved-ones-to-be-on-their-guard/",
      "https://grahamcluley.com/garmin-ransomware-attack/",
      "https://grahamcluley.com/smashing-security-podcast-189/",
      "https://grahamcluley.com/free-iphone-apple-bug-hunters/",
      "https://grahamcluley.com/smashing-security-podcast-188/",
      "https://grahamcluley.com/feed-sponsor-recorded-future-4/",
      "https://grahamcluley.com/uk-government-russia-report/",
      "https://grahamcluley.com/mitre-the-creepy-company-checking-your-fingerprints-on-facebook-for-the-us-government/"
    ],
    "skipped": [
      "https://www.grahamcluley.com/graham-cluley-on-totally-unprepared-politics-podcast/",
      "https://www.grahamcluley.com/amazon-ring-staff-spied-videos/",
      "https://www.grahamcluley.com/currys-pc-world-dixons-data-breach/",
      "https://www.bitdefender.com/box/blog/iot-news/cryptojacked-routers-reduce-78-se-asia-following-operation-goldfish-alpha#new_tab",
      "https://www.grahamcluley.com/stop-everything-update-firefox-now/",
      "https://www.tripwire.com/state-of-security/featured/man-jailed-using-webcam-rat-women-bedrooms/#new_tab",
      "https://www.grahamcluley.com/smashing-security-160-snafus-ms-word-amazon-ring-and-tiktok/",
      "https://www.grahamcluley.com/city-of-las-vegas-wakes-up-to-a-cyber-attack/",
      "https://www.grahamcluley.com/travelex-ransomware/",
      "https://www.grahamcluley.com/feed-sponsor-av-comparatives-1/",
      "https://www.grahamcluley.com/ransomware-shuts-company/",
      "https://www.grahamcluley.com/travelex-still-offline-after-discovering-malware-on-new-years-eve-and-other-banks-currency-services-are-also-affected/",
      "https://www.grahamcluley.com/smashing-security-159-rap-robbery-and-iot-holiday-hell/",
      "https://www.grahamcluley.com/feed-sponsor-recorded-future/",
      "https://www.tripwire.com/state-of-security/security-data-protection/waco-water-bill-attack-click2gov-breaches/#new_tab",
      "https://www.grahamcluley.com/smashing-security-158-the-man-behind-the-missing-cryptoqueen/",
      "https://hotforsecurity.bitdefender.com/blog/web-h