qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

WIP: Sitemap #13

Closed b5 closed 5 years ago

b5 commented 5 years ago

Should get us ready to test our first milestone 😄.

Easiest way to play with this one is via the restored walk sitemap command, which accepts a configuration and generates a sitemap.json file whenever you SIGKILL the process (control-C).

This PR introduces badger as a key-value store that sitemap can rely on to store requests that can be built up as Resources come in, and output the final sitemap by iterating keys with a matching prefix to generate the sitemap itself. This Approach also makes versioning different sitemap crawls by varying up the prefix a near-term possibility.

Based on previous work with sitemap building & massive amounts of RAM consumption, I feel like Badger will provide a lot of mileage for this project moving forward. This PR will eventually include a Badger implementation of the RequestStore interface.

Mr0grog commented 5 years ago

Is this ready for (or desiring) review?

b5 commented 5 years ago

I'd love a mid-point checkin 😄. Generally I like to formally "request review" when ready for such a thing, but I'm a huge fan of mid-way comments to keep PRs on the right track.

In my head at least two things are left to do in this PR:

I'm thinking redirect handling will be better handled as a separate PR.

Mr0grog commented 5 years ago

(Should have said: mostly looks awesome! Hence my commends being pretty confined to naming and not much functionality or architecture :P)

Mr0grog commented 5 years ago

I'm thinking redirect handling will be better handled as a separate PR.

Sounds good to me.

Mr0grog commented 5 years ago

the restored walk sitemap command, which accepts a configuration and generates a sitemap.json file whenever you SIGKILL the process (control-C).

Is the SIGKILL requirement just because we don’t have a protocol for the coordinator telling all the handlers that it’s done? (Maybe we should add that?)

Oh, hmmmm, so also, it seems like if you don’t have a config that trips over https://github.com/qri-io/walk/pull/2#discussion_r209515313, then this will exit automatically without generating a sitemap. That seems problematic :\

codecov-io commented 5 years ago

Codecov Report

Merging #13 into master will increase coverage by 2.43%. The diff coverage is 66.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
+ Coverage   60.67%   63.11%   +2.43%     
==========================================
  Files           8       10       +2     
  Lines         417      553     +136     
==========================================
+ Hits          253      349      +96     
- Misses        128      155      +27     
- Partials       36       49      +13
Impacted Files Coverage Δ
lib/walk.go 73.33% <100%> (-3.14%) :arrow_down:
lib/resource.go 62.31% <100%> (+2.94%) :arrow_up:
lib/resource_handler.go 36.36% <16.66%> (+28.95%) :arrow_up:
lib/coordinator.go 56.86% <22.72%> (-5.74%) :arrow_down:
lib/config.go 80% <60%> (-6.67%) :arrow_down:
lib/badger.go 75% <75%> (ø)
lib/sitemap.go 76.19% <76.19%> (ø)
lib/worker.go 55.83% <77.27%> (+0.08%) :arrow_up:
lib/queue.go 80.95% <80.95%> (+5.95%) :arrow_up:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5e8f3fe...ee74ade. Read the comment docs.

b5 commented 5 years ago

Bunch of updates for ya' @Mr0grog. To test I've been running walk start --config=config.json with this config.json:

{
    "Coordinator": {
        "Domains": [
            "https://datatogether.org"
        ],
        "IgnorePatterns": [],
        "Seeds": [
            "https://datatogether.org"
        ],
        "StopAfterEntries": 0,
        "UnfetchedScanFreqMilliseconds": 30000,
        "BackupWriteInterval": 500,
        "BackoffResponseCodes": [
            403
        ],
        "DoneScanMilli": 1000
    },
    "Queue": {
        "Type": "local"
    },
    "RequestStore": {
        "Type": "local"
    },
    "Workers": [
        {
            "Type": "local",
            "Polite": false,
            "UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
            "Parallelism": 5,
            "RecordRedirects": true,
            "RecordResponseHeaders": true,
            "DelayMilli": 100
        }
    ],
    "ResourceHandlers": [{
        "type" : "sitemap",
        "prefix" : "sm_01",
        "destPath" : "sitemap.json"
    },{
        "type" : "cbor",
        "destPath" : "requests"
    }]
}
Mr0grog commented 5 years ago

Awesome! Dunno if I will have time to look tonight; will try and get it done sometime tomorrow.

b5 commented 5 years ago

a'ight @Mr0grog, you up

b5 commented 5 years ago

This isn't a direct-fix for your config worries, but I think a more sustainable first step toward demystifing config: I've added a new command walk config that prints out the current config walk is using, which will be the default if no config is present.

I think we should at least add documentation about how to use this command to understand configuration, but is this sufficient to merge this branch @Mr0grog?

Mr0grog commented 5 years ago

Hmmm, I don’t think this addresses my concerns about configuring Badger, though, which is the complicated part (because those config options have to be looked up from outside this codebase).

Mr0grog commented 5 years ago

(If you do ./walk config it reads in the config.json you added, which doesn’t configure Badger, and so it doesn’t output any Badger config. That’s all fine if you aren’t using the sitemap resource handler, but as soon as you do, this command won’t really help you figure out what to add unless you know you have to first totally remove all config files.)

b5 commented 5 years ago

Ah yes, ok, mind if we merge this & open up an issue to discuss. I totally agree with your points, and think we should slow down & write a proper fix instead of tacking stuff onto a month-old PR 😉

Mr0grog commented 5 years ago

👍 works for me.

b5 commented 5 years ago

lovely