Unable to Crawl Collections or Websites

Perplexitus commented 1 month ago

Hi @mitra42,

Thanks for this information in "Issue #383 ". That helped me confirm that dweb.me/info isn't necessary.

I'm attempting to deploy an Internet in a Box (IIAB) for home and family use as an emergency preparation item. I'm trying to crawl a few websites (not necessarily whole collections).

However, each crawl doesn't get past "dweb-transports:httptools p_httpfetch: https://dweb.me/info '' +0ms" (from journalctl), since it times out.

I've been reviewing a few of the documentation files:

API.md
URL_MAPPING.md
INSTALLATION-iiab-rpi.md

I've also been looking into TransportHTTP.js urlbase: 'https://dweb.me',

Questions:

dweb.me seems to be down due to the October 9th incident. Is there a way instead get collections from archive.org? Such as changing the urlbase in TransportHTTP.js?
internetarchive is also supposed to be a "proxy" and collect web pages that users navigate to (given the the iiab box has internet access, and the user is only connected to the iiab box (no mobile data)). However, I noticed that none of the pages are collected despite navigating to websites using the iiab as the access point. Any suggestions on troubleshooting this issue? EDIT: I see I haven't successfully gotten internet access when connected to iiab as a client. Getting "DNS_PROBE_FINISHED_BAD_CONFIG". Currently troubleshooting.
Can you give examples of an "identifier"? I see "prelinger" is the only example mentioned in the help pages of internetarchive. Would this refer to the prelinger collection at archive.org? EDIT: I now see in USING.md, it mentions using "foo" as an identifier; so https://archive.org/details/foobar would be correct using "foobar" as the identifier.

I'm okay if there is no (current) way to get collections. My main objective is to crawl many/all of the child pages of a specific URL, such as churchofjesuschrist.org/study/

I appreciate your time :)

mitra42 commented 1 month ago

dweb-mirror doesn't crawl websites, it only crawls Internet Archive items,
There was the intent to add websites from the Wayback Machine, but the technology used is very different from the static collections, and we didn't get to it when the project was put on hold. *Identifiers are the Archie identifiers e.g. "prelinger" is the collection that is viewable on the main website as https://archive.org/details/prelinger
Its only a proxy for Internet Archive items, collections etc, NOT for web pages - was never intended to do that (there might be something else on IIAB that does this, but I'm not familiar with what else is there.
To be honest, I can't remember what was at dweb.me, and what at other URLs but if it went to something other than archive.org there was a reason - probably because the actual API isn't offered on archive.org. I'm not going to pester the techies at IA, as until the IA is fully back , they have other priorities.

Perplexitus commented 1 month ago

Thanks for pointers, @mitra42.

I fixed the issue by going to "/opt/iiab/internetarchive/node_modules/@internetarchive" and replacing all mentions of "dweb.me", and "www-dweb-cors.dev." (including that period at the end), in all of the child directories and files with "archive.org".

After I did that, the UI started working properly and I am able to search collections myself on my iiab box. I now see what it means when it says that it acts as a proxy. There's a search bar that appears after it connects successfully. By using that search bar, I'm able to browse archive.org.

mitra42 commented 1 month ago

Great - glad its working - and somewhat surprised, though maybe its only some features that have to go through the "cors" gateway.

Perplexitus commented 1 month ago

I realize now that after clicking "Go" twice to search, it redirected my to archive.org.

So, I reviewed the journalctl (for community members, that's "journalctl -u internetarchive -f"), and found the culprits. There were some advanced searches that didn't like an empty "and" array. https://archive.org/advancedsearch.php?output=json&q=churchofJesusChrist.org&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status

So, I had to remove "&and[]=" from a few files for the advanced search query from a few files, and delete a line that contained "and[]" that manually built an advanced query.

After doing that, my searches started being successful.

internetarchive / dweb-mirror

Unable to Crawl Collections or Websites #385