Closed b5 closed 5 years ago
Is this ready for (or desiring) review?
I'd love a mid-point checkin đ. Generally I like to formally "request review" when ready for such a thing, but I'm a huge fan of mid-way comments to keep PRs on the right track.
In my head at least two things are left to do in this PR:
I'm thinking redirect handling will be better handled as a separate PR.
(Should have said: mostly looks awesome! Hence my commends being pretty confined to naming and not much functionality or architecture :P)
I'm thinking redirect handling will be better handled as a separate PR.
Sounds good to me.
the restored
walk sitemap
command, which accepts a configuration and generates asitemap.json
file whenever youSIGKILL
the process (control-C).Is the
SIGKILL
requirement just because we donât have a protocol for the coordinator telling all the handlers that itâs done? (Maybe we should add that?)
Oh, hmmmm, so also, it seems like if you donât have a config that trips over https://github.com/qri-io/walk/pull/2#discussion_r209515313, then this will exit automatically without generating a sitemap. That seems problematic :\
Merging #13 into master will increase coverage by
2.43%
. The diff coverage is66.27%
.
@@ Coverage Diff @@
## master #13 +/- ##
==========================================
+ Coverage 60.67% 63.11% +2.43%
==========================================
Files 8 10 +2
Lines 417 553 +136
==========================================
+ Hits 253 349 +96
- Misses 128 155 +27
- Partials 36 49 +13
Impacted Files | Coverage Î | |
---|---|---|
lib/walk.go | 73.33% <100%> (-3.14%) |
:arrow_down: |
lib/resource.go | 62.31% <100%> (+2.94%) |
:arrow_up: |
lib/resource_handler.go | 36.36% <16.66%> (+28.95%) |
:arrow_up: |
lib/coordinator.go | 56.86% <22.72%> (-5.74%) |
:arrow_down: |
lib/config.go | 80% <60%> (-6.67%) |
:arrow_down: |
lib/badger.go | 75% <75%> (ø) |
|
lib/sitemap.go | 76.19% <76.19%> (ø) |
|
lib/worker.go | 55.83% <77.27%> (+0.08%) |
:arrow_up: |
lib/queue.go | 80.95% <80.95%> (+5.95%) |
:arrow_up: |
... and 2 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Î = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5e8f3fe...ee74ade. Read the comment docs.
Bunch of updates for ya' @Mr0grog. To test I've been running walk start --config=config.json
with this config.json:
{
"Coordinator": {
"Domains": [
"https://datatogether.org"
],
"IgnorePatterns": [],
"Seeds": [
"https://datatogether.org"
],
"StopAfterEntries": 0,
"UnfetchedScanFreqMilliseconds": 30000,
"BackupWriteInterval": 500,
"BackoffResponseCodes": [
403
],
"DoneScanMilli": 1000
},
"Queue": {
"Type": "local"
},
"RequestStore": {
"Type": "local"
},
"Workers": [
{
"Type": "local",
"Polite": false,
"UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
"Parallelism": 5,
"RecordRedirects": true,
"RecordResponseHeaders": true,
"DelayMilli": 100
}
],
"ResourceHandlers": [{
"type" : "sitemap",
"prefix" : "sm_01",
"destPath" : "sitemap.json"
},{
"type" : "cbor",
"destPath" : "requests"
}]
}
Awesome! Dunno if I will have time to look tonight; will try and get it done sometime tomorrow.
a'ight @Mr0grog, you up
This isn't a direct-fix for your config worries, but I think a more sustainable first step toward demystifing config: I've added a new command walk config
that prints out the current config walk is using, which will be the default if no config is present.
I think we should at least add documentation about how to use this command to understand configuration, but is this sufficient to merge this branch @Mr0grog?
Hmmm, I donât think this addresses my concerns about configuring Badger, though, which is the complicated part (because those config options have to be looked up from outside this codebase).
(If you do ./walk config
it reads in the config.json
you added, which doesnât configure Badger, and so it doesnât output any Badger config. Thatâs all fine if you arenât using the sitemap
resource handler, but as soon as you do, this command wonât really help you figure out what to add unless you know you have to first totally remove all config files.)
Ah yes, ok, mind if we merge this & open up an issue to discuss. I totally agree with your points, and think we should slow down & write a proper fix instead of tacking stuff onto a month-old PR đ
đ works for me.
lovely
Should get us ready to test our first milestone đ.
Easiest way to play with this one is via the restored
walk sitemap
command, which accepts a configuration and generates a sitemap.json file whenever youSIGKILL
the process (control-C).This PR introduces badger as a key-value store that sitemap can rely on to store requests that can be built up as Resources come in, and output the final sitemap by iterating keys with a matching prefix to generate the sitemap itself. This Approach also makes versioning different sitemap crawls by varying up the prefix a near-term possibility.
Based on previous work with sitemap building & massive amounts of RAM consumption, I feel like Badger will provide a lot of mileage for this project moving forward. This PR will eventually include a Badger implementation of the
RequestStore
interface.