qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

Add HTML Test Cases with resources, links & subresources #3

Open b5 opened 5 years ago

b5 commented 5 years ago

pssst, this would be a great PR to tackle if you don't code in go, it's almost all HTML!

One of the aims of this project is to have workers produce a list of links and subresources. We need an initial test case that has 1 resource with at least one subresource and one link.

Writing test cases for a web crawler

It's important that tests aim to be deterministic, meaning they should run the same every time. Testing deterministic behavior for software that crawls sites. So for that reason our tests of this software should kinda work like this:

Why this matters

An existing short-term objective of this project is to replace some existing infrastructure. We won't know if we can rely on this stuff until it's tested. The sooner we get a good set of tests, the more we can iterate on those test cases to bring them closer.

Current Problems & Antipatterns

Running tests:

Before we can incorporate these tests, we'll need to landed #2. Once that's in, I'll help you add in the go code that actually runs the test. It's going to look something like this:

func TestSubResourceCount(t *testing.T) {
    s := httptest.NewServer(http.FileServer(http.Dir("testdata/qri_io")))

    crawl, stop, err := NewWalk(
        JSONConfigFromFilepath("testdata/qri_io.config.json"),
        ServerJSONConfig(s),
    )
    if err != nil {
        t.Fatal(err.Error())
    }
    crawl.Start(stop)

        // checks for returned resources ...
}

Steps to Complete: