Closed HarshNarayanJha closed 2 weeks ago
This PR replaces github.com/clbanning/mxj/v2 and uses encoding/xml xml.Decoder to parse xml and extract urls within.
github.com/clbanning/mxj/v2
encoding/xml
xml.Decoder
EDIT: All tests pass now. The PR is complete
Co-Author: @yzqzss
Only two tests fail, I am trying to fix those
Closes #84
Hey, thank you, how is it going?
Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ?
Thanks guys!
You're welcome
This PR replaces
github.com/clbanning/mxj/v2
and usesencoding/xml
xml.Decoder
to parse xml and extract urls within.EDIT: All tests pass now. The PR is complete
Co-Author: @yzqzss
Only two tests fail, I am trying to fix those
Tests
```fish go: downloading git.archive.org/wb/gocrawlhq v1.2.13 internal/pkg/crawl/config.go:11:2: unrecognized import path "git.archive.org/wb/gocrawlhq": https fetch: Get "https://git.archive.org/wb/gocrawlhq?go-get=1": dial tcp 207.241.235.124:443: i/o timeout === RUN TestJSON === RUN TestJSON/Valid_JSON_with_URLs === RUN TestJSON/Invalid_JSON === RUN TestJSON/JSON_with_no_URLs === RUN TestJSON/JSON_with_URLs_in_various_fields === RUN TestJSON/JSON_with_array_of_URLs --- PASS: TestJSON (0.00s) --- PASS: TestJSON/Valid_JSON_with_URLs (0.00s) --- PASS: TestJSON/Invalid_JSON (0.00s) --- PASS: TestJSON/JSON_with_no_URLs (0.00s) --- PASS: TestJSON/JSON_with_URLs_in_various_fields (0.00s) --- PASS: TestJSON/JSON_with_array_of_URLs (0.00s) === RUN TestXML === RUN TestXML/Valid_XML_with_URLs === RUN TestXML/Empty_XML === RUN TestXML/Invalid_XML === RUN TestXML/XML_with_invalid_URL === RUN TestXML/Huge_sitemap xml_test.go:88: XML() gotURLs count = 10000, want 100002 --- FAIL: TestXML (0.76s) --- PASS: TestXML/Valid_XML_with_URLs (0.00s) --- PASS: TestXML/Empty_XML (0.00s) --- PASS: TestXML/Invalid_XML (0.00s) --- PASS: TestXML/XML_with_invalid_URL (0.00s) --- FAIL: TestXML/Huge_sitemap (0.66s) === RUN TestXMLBodyReadError xml_test.go:127: XML() expected error, got nil --- FAIL: TestXMLBodyReadError (0.00s) FAIL FAIL github.com/internetarchive/Zeno/internal/pkg/crawl/extractor 0.780s FAIL ```Closes #84