pssst, this would be a great PR to tackle if you don't code in go, it's almost all HTML!
One of the aims of this project is to have workers produce a list of links and subresources. We need an initial test case that has 1 resource with at least one subresource and one link.
Writing test cases for a web crawler
It's important that tests aim to be deterministic, meaning they should run the same every time. Testing deterministic behavior for software that crawls sites. So for that reason our tests of this software should kinda work like this:
We make small websites that are (generally) just static HTML. These should be really concise, only highlighting things that crawlers care about like "title" tags, <a href="..."> link elements, subresources like <script src="..."> elements, and then eventually weird edge cases like <svg> and <canvas> elements that load outside assets in strange ways.
While these tests should be short, they should approximate real-world HTML. Real HTML includes mistakes, (eg: unclosed tags, capitalization errors, weird text encodings). These weird conditions should be separated out into small, separate test cases
We also need to be able to human-verify aspects of these web pages, like "this page has 4 outbound links, one of those links is relative, the other three are to different domains"
For each test of our crawler we:
spin up these test cases as a local server
create the crawler with a specific configuration
send the crawler at that local server
check that what the crawler found matches our expectations
An existing short-term objective of this project is to replace some existing infrastructure. We won't know if we can rely on this stuff until it's tested. The sooner we get a good set of tests, the more we can iterate on those test cases to bring them closer.
Current Problems & Antipatterns
The base test case I've added is too messy, the examples we create should be more concise, and labeled around the thing the case is intended to test
We don't have a consistent test harness definition. More on that soon.
Running tests:
Before we can incorporate these tests, we'll need to landed #2. Once that's in, I'll help you add in the go code that actually runs the test. It's going to look something like this:
func TestSubResourceCount(t *testing.T) {
s := httptest.NewServer(http.FileServer(http.Dir("testdata/qri_io")))
crawl, stop, err := NewWalk(
JSONConfigFromFilepath("testdata/qri_io.config.json"),
ServerJSONConfig(s),
)
if err != nil {
t.Fatal(err.Error())
}
crawl.Start(stop)
// checks for returned resources ...
}
Steps to Complete:
[ ] Clone / Fork this Repo
[ ] Compose a concise, static HTML page with 1 resource and 1 _subresource
[ ] Call that file index.html and put it in github.com/qri-io/walk/lib/testdata/[test_case_name]
pssst, this would be a great PR to tackle if you don't code in go, it's almost all HTML!
One of the aims of this project is to have workers produce a list of links and subresources. We need an initial test case that has 1 resource with at least one subresource and one link.
Writing test cases for a web crawler
It's important that tests aim to be deterministic, meaning they should run the same every time. Testing deterministic behavior for software that crawls sites. So for that reason our tests of this software should kinda work like this:
<a href="...">
link elements, subresources like<script src="...">
elements, and then eventually weird edge cases like<svg>
and<canvas>
elements that load outside assets in strange ways.Why this matters
An existing short-term objective of this project is to replace some existing infrastructure. We won't know if we can rely on this stuff until it's tested. The sooner we get a good set of tests, the more we can iterate on those test cases to bring them closer.
Current Problems & Antipatterns
Running tests:
Before we can incorporate these tests, we'll need to landed #2. Once that's in, I'll help you add in the go code that actually runs the test. It's going to look something like this:
Steps to Complete:
index.html
and put it ingithub.com/qri-io/walk/lib/testdata/[test_case_name]