qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

feat: initial API, round out flow for local proof-of-concept #20

Closed b5 closed 5 years ago

b5 commented 5 years ago

Opening this PR now so others can track progress & add comments.

One finished, this PR should be an initial implementation of #16, and allow the following flow:

  1. Run a crawl, recording details to a local directory of CBOR records (walk start)
  2. Do another crawl, recording to a different local directory of CBOR records (walk start again, with a different target dir)
  3. spin up an API that reads from both local directories, and serves an API that allows time/url parsing across (walk server)

Once that's possible, we merge, party, and move on to filing lots of bugs.

I'd like to break this PR into two parts: the first to land just a proof-of-concept workflow so others can play with it, and then a follow up PR to incorporate feedback. This second PR should be where a lot of bug closing happens. and we get out of proof-of-concept land and into usable code land. By doing it this way I'm hoping a brave soul or two will be willing to use this buggy code to help write some initial documentation.

After that, only three things stand in our way of an initial staging server:

Frijol commented 5 years ago

I'd like to break this PR into two parts: the first to land just a proof-of-concept workflow so others can play with it, and then a follow up PR to incorporate feedback. This second PR should be where a lot of bug closing happens. and we get out of proof-of-concept land and into usable code land. By doing it this way I'm hoping a brave soul or two will be willing to use this buggy code to help write some initial documentation.

Love this.

Mr0grog commented 5 years ago

From review on a call:

b5 commented 5 years ago

Oooook I think this is ready for review. Lots to do, but from our latest round of feedback:

I've also added a feature I think is important for this phase: the capacity to fetch seeds from a file or URL. a new string configuratino property: config.Coordinator.SeedsPath lets you supply a string that's either a URL or a filepath (relative to pwd, or absolute) of a newline-delimited list of urls to seed. I've been using this to point at the raw text from @Mr0grog's gist: https://gist.github.com/Mr0grog/40cdcd56b048d7f00b0d47d3aca70be0/raw/c6ad8f6f55c93ab46b033d4486033b249c8b65db/webmonitoring_active_urls.txt

The biggest thing I'd like to tackle next:

but all in all I think we should merge this & start into documentation & testing for the web monitoring use-case. Would love to hear if others agree

b5 commented 5 years ago

Alright, after today's team call I'm re-fired up to take a run at saving this project. I'm going to merge this b/c we're already waaaaaay past master, and I can't write a proper "getting started" readme without doing some of the work we agreed needs doing first in today's call.