machawk1 / wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
https://matkelly.com/wail
MIT License
345 stars 32 forks source link

Crawls of mkdc only return DNS record in WARC #458

Open machawk1 opened 4 years ago

machawk1 commented 4 years ago

Tested in both the basic and advanced interface, tried crawling https://matkelly.com and the default https://matkelly.com/wail, both resulting WARCs only contain the DNS record.

Other URIs seem to produce the correct results.

machawk1 commented 4 years ago

Promoting this issue via pinning to give it priority.

Received a report from Wyeth Lynch trying to capture https://www.sdstate.edu/covid-19 with WAIL 2019.05.21. I replicated this in the latest master and only saw a DNS captured.

ezgif com-video-to-gif

Need to recheck the generated Heritrix configuration to see what this is occurring.

Also, this UI/UX needs to be refined to give users the impression that the crawl does not immediately complete, e.g., give direct access via a link or a button to the crawl status.

machawk1 commented 2 years ago

This might be attributed to the startup script including the correct Heritrix libraries per http://web.archive.org/web/20110928012834/http://tech.groups.yahoo.com/group/archive-crawler/message/772 .

The newer releases of Heritrix, when installed in WAIL, do not seem to exhibit the problem. A next-step might be to diff the startup scripts.