internetarchive / dweb-mirror

Offline Internet Archive project
https://www-dweb-mirror.dev.archive.org/
GNU Affero General Public License v3.0
273 stars 31 forks source link

Unable to Crawl Collections or Websites #385

Closed Perplexitus closed 1 month ago

Perplexitus commented 1 month ago

Hi @mitra42,

Thanks for this information in "Issue #383 ". That helped me confirm that dweb.me/info isn't necessary.

I'm attempting to deploy an Internet in a Box (IIAB) for home and family use as an emergency preparation item. I'm trying to crawl a few websites (not necessarily whole collections).

However, each crawl doesn't get past "dweb-transports:httptools p_httpfetch: https://dweb.me/info '' +0ms" (from journalctl), since it times out.

I've been reviewing a few of the documentation files:

I've also been looking into TransportHTTP.js urlbase: 'https://dweb.me',

Questions:

I'm okay if there is no (current) way to get collections. My main objective is to crawl many/all of the child pages of a specific URL, such as churchofjesuschrist.org/study/

I appreciate your time :)

mitra42 commented 1 month ago
Perplexitus commented 1 month ago

Thanks for pointers, @mitra42.

I fixed the issue by going to "/opt/iiab/internetarchive/node_modules/@internetarchive" and replacing all mentions of "dweb.me", and "www-dweb-cors.dev." (including that period at the end), in all of the child directories and files with "archive.org".

After I did that, the UI started working properly and I am able to search collections myself on my iiab box. I now see what it means when it says that it acts as a proxy. There's a search bar that appears after it connects successfully. By using that search bar, I'm able to browse archive.org.

mitra42 commented 1 month ago

Great - glad its working - and somewhat surprised, though maybe its only some features that have to go through the "cors" gateway.

Perplexitus commented 1 month ago

I realize now that after clicking "Go" twice to search, it redirected my to archive.org.

So, I reviewed the journalctl (for community members, that's "journalctl -u internetarchive -f"), and found the culprits. There were some advanced searches that didn't like an empty "and" array. https://archive.org/advancedsearch.php?output=json&q=churchofJesusChrist.org&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status

So, I had to remove "&and[]=" from a few files for the advanced search query from a few files, and delete a line that contained "and[]" that manually built an advanced query.

After doing that, my searches started being successful.