Open machawk1 opened 4 years ago
Promoting this issue via pinning to give it priority.
Received a report from Wyeth Lynch trying to capture https://www.sdstate.edu/covid-19 with WAIL 2019.05.21. I replicated this in the latest master and only saw a DNS captured.
Need to recheck the generated Heritrix configuration to see what this is occurring.
Also, this UI/UX needs to be refined to give users the impression that the crawl does not immediately complete, e.g., give direct access via a link or a button to the crawl status.
This might be attributed to the startup script including the correct Heritrix libraries per http://web.archive.org/web/20110928012834/http://tech.groups.yahoo.com/group/archive-crawler/message/772 .
The newer releases of Heritrix, when installed in WAIL, do not seem to exhibit the problem. A next-step might be to diff the startup scripts.
Tested in both the basic and advanced interface, tried crawling https://matkelly.com and the default https://matkelly.com/wail, both resulting WARCs only contain the DNS record.
Other URIs seem to produce the correct results.