Open guitarscape opened 5 years ago
Hi @guitarscape , that error usually means that the version you are looking for was found in the index as you say, but the application is unable to access the content from the WARC file where it is expecting to find it. Are you using the default configuration with BDB indexing etc.?
Hi lauren, thanks for your response! we only modified wayback.xml (we changed wayback.url.host.default , added wayback.url.context and changed accessPointPath to "/wayback" so that openwayback is install in tomcat non-root context (based on this: https://github.com/iipc/openwayback/issues/160)
I presume you put your .warc file in either of the default directories where owb will look for those files in either ${wayback.basedir}/files1/
or ${wayback.basedir}/files2/
, and they were presumably indexed there. Is the .warc file still there? Are the file permission still readable? Has the file been altered since indexing (e.g. it started as a warc.gz and then was unzipped)?
Sorry if these questions seem obvious--I am accustomed to seeing the Resource Not Available
message arise with people (such as me) using CDX indexes and path-index.txt; I can't remember having it coming up with BDB.
we used the default wayback.basedir (/tmp/openwayback) for testing purpose and put a warc file in /tmp/openwayback/files1. we also made sure that all files and folders are owned by tomcat:tomcat before .starting tomcat. We also tried removed file-db, index, index-data folders in /tmp/openwayback so that they are recreated when tomcat is restarted. and the outcome were observed. is there a need to modify any other xml files besides wayback.xml? Thanks for your help!
It sounds like you are minimally changing the configuration and have followed the steps given at these wiki pages, in which case you should not need to change other xml files: https://github.com/iipc/openwayback/wiki/How-to-configure https://github.com/iipc/openwayback/wiki/Deploying-OpenWayback-in-non-ROOT-Context
As far as troubleshooting goes:
Have you looked at the Tomcat log to see if there are any related messages?
You could try launching OpenWayback without any configuration changes from the default (using ROOT context) to see if it works.
To see if the issue is with the WARC file versus the setup, have you tried a different WARC file to see if you get the same result? You could also try your WARC file with a containerized instance of OpenWayback: https://github.com/iipc/openwayback/wiki/How-to-build-and-run-in-Docker
we stopped the tomcat, removed the existing OWB and ROOT context. Renamed the OWB .war file to ROOT.war, removed the file-db, index, index-data folders in /tmp/openwayback but kept the .warc file in files1 folder. Then we started tomcat. Exactly the same result. Tomcat log shows code 200 to visit OWB home page and to conduct a search, but code 503 (resource not available) to actually click on an version (date).
we could play the same .warc file in pywb.
With a 503
there should be some error logged to your Tomcat log. Have you looked in your Tomcat's logs/catalina.out
?
unfortunately there is no other log entry related to the error. our set up uses vanilla centos 7.6 + openjdk 1.8, tomcat 7.0
Hi,
have you been able to find a solution to this issue? We seem to be experiencing the same problem after an OS upgrade and a new BDB index for OpenWayback 2.3.0. Although the different captures are listed, OpenWayback returns Resource Not Available
when requesting a specific capture.
In the logs we see for these requests:
tomcat: WARNING:(1)LOADFAIL: Unable to locate(XYZ-2739857.warc) /20171111130122/http://www.xyz.com/
tomcat: WARNING: Runtime Error
It would be great if you could give us some pointers!
Hi @schmika, were the same WARCs working in Openwayback 2.3.0 before the OS upgrade? What OS did you upgrade to/from? What versions of Java and Tomcat are you running--did this change from when things were working? Do you see your WARCs listed with the correct paths in ${wayback.basedir}/file-db/incoming/files1 (replace "files1" with the directory name of your WARCs if needed)?
Thank you for your reply, @ldko! We've upgraded from Red Hat 6.10 to Red Hat 7.8. The Java (1.8) and Tomcat versions (7.0) have remained the same. We were previously able to display the WARC files and their paths are listed in the file you mentioned. What is curious is that sometimes after several unsuccessful attempts the request will be successful and the archived resource will be displayed correctly. I don't know whether that's related to our issue but I have read in the documentation that a BDB index should only be used for "very small" collections - is there a limit in terms of the number of WARC files or the overall size of the files where you'd recommend switching to a CDX index?
Hi @schmika , I don't know if anyone has determined a rough size limit for a cutoff on BDB index. I pretty much only use BDB for small testing purposes, otherwise I use CDX. I wouldn't think this would be causing your problem though, sine the lookups for captures are working.
One thing that comes to mind to check is if SELinux is enabled and enforcing on the machine which could cause a problem with serving files. You can check this with the sestatus
command. It does seem odd though if the same request will occasionally work...
Dear @ldko,
Thank you very much for your answer. I am a colleague of @schmika. I just run the sestatus
command and got disabled
. We have 159,109 warc files. We have 159,109 warc files. With this amount, would you recommend switching to CDX index?
I found out that OpenWayback apparently tries to re-index warc files that were have already been indexed. In the Tomcat messages I find lines like this:
/var/log/messages:Jun 29 00:07:29 tomcat: INFO: Queued ABC.warc for indexing.
/var/log/messages:Jun 29 00:08:28 tomcat: INFO: Indexing ABC.warc from (...)
/var/log/messages:Jun 29 00:08:29 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists!
/var/log/messages:Jun 29 00:27:13 tomcat: INFO: AddedABC.warc (...)/ABC.warc
/var/log/messages:Jun 29 01:27:27 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists!
/var/log/messages:Jun 29 02:20:49 tomcat: INFO: Added ABC.warc (...)/ABC.warc
/var/log/messages:Jun 29 03:43:01 tomcat: INFO: Indexing ABC.warc from (...)
/var/log/messages:Jun 29 03:43:03 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists!
It seems that OpenWayback tries to index ABC.warc at 0:07, 1:27 and 3:43 again!
At 6:55 OpenWayback removes the file from the index queue and at 7:41 it adds the file again! Could this phenomenon be related to using a BDB index with our collection?
Hi @ThomasOsterman ,
I have not seen this behavior before. Looking at the source code, the "already exists!" warning happens when a WARC is indexed, but an index file for it is already where the new one would be copied to (index-data/incoming if it is not changed in BDBCollection.xml) before another process finds it there and moves it again to index-data/merged. I am not sure why your WARCs are getting queued for indexing more than once nor why the index files would be lingering in the index-data/incoming dir long enough to trigger the warning (there is an interval setting in BDBCollection.xml that controls how often that directory is checked). These could be symptoms of whatever is causing the WARC files not to be found when the web app is showing Resource Not Available
, but I don't think they would be directly causing it.
One other thing I am wondering, are there duplicate WARC names with different paths in the file-db/state/* files?
Regarding your question about switching to CDX, I would recommend trying CDX files or CDX Server and a path-index.txt configuration. Though you could try it with a small subset of your WARCs to make sure it is working before indexing everything. I also recommend upgrading to OpenWayback 2.4.0.
Dear @ldko,
Thank you very much for your explantations. I just checked some files and their paths in the file-db/state/* files. There are no duplicate WARC names with different paths. However, in file-db/state there are two files named files1 and files 2. Some of the WARC files are listed in both files1 and files2, some are only listed in one of the files.
@ThomasOsterman for the duplicate WARC names that are listed in both files1 and files2, are they WARC files that happen to have the same name but different content (could potentially cause a problem as I believe the names are expected to be unique since the index of what content is inside the WARC specifies a WARC name but not path and wouldn't know which to use), are they duplicates of the same WARC that exist in both listed paths (shouldn't cause the problem), or does the duplicate WARC name only exist in one place (might cause a problem)?
@ldko They are unique WARC files that exist only in one place but are listed both in files1 and files2. For example, both of the files contain the following line:
ABC.warc /home/thomas/ABC.warc
There is one file ABC.warc
, which is located in /home/thomas, but is listed in both of the files.
@ldko I just noticed a configuration error in our BDBcollection.xml: We have two Wayback Archive dirs, ${wayback.archivedir.1} and ${wayback.archivedir.2}. Probably due to a careless mistake, the DirectoryResourceFileSource beans for both directories were called "files1". That might explain the phenomenon with the duplicate entries.
@ThomasOsterman That's good to know :). I just tried setting two DirectoryResourceFileSource
to the same name like that, and that will cause the WARNING: (1)LOADFAIL: Unable to locate...
error that @schmika reported. What happens when processing the second DirectoryResourceFileSource
with the same name, is that the WARC names and their paths from that location get added to the file location db and whatever entries were added to the file location db for the first bean's path get removed. Thus, whatever WARCs were in the first location called "files1" get removed from the location db and can't be found. The content of the WARCs is still indexed, so searching for the URL returns results, but when the application then tries to look up where the WARC holding that content should be, it does not find the WARC name in the location database and has no path to serve content from. Fixing that configuration error and restarting Tomcat should fix the Resource Not In Archive message. Please let us know if it does!
Dear @ldko,
The configuration error in the BDBCollection.xml was actually the reason for our problems and the duplicate entries. After renaming the DirectoryResourceFileSource bean for ${wayback.archivedir.2} to "files2" and deleting the old index, the index was built correctly. Thank you very much for your help!
Hi!
I am using OWB2.4.1-SNAPSHOT to index and visualize .warc.gz files crawled with Heritrix and I'm getting the same error (The Resource you have requested is temporarily unavailable. Please try again later) but I'm using CDX instead of BDB.
I am trying with just 4 captures (1 warc.gz each) to find the error but I still can't solve it. What my indexing script does: -Create .cdx of .warc.gz -Sort -Merge to 'index.cdx' -Restart Tomcat
PD: I checked the content of both 'index.cdx' and 'path-index.txt' and they look OK.
When I try to visualize OWB returns the error. Repeating the process returns different results as somehow one of these 4 can be visualized regularly and another one just sometimes.
I tried indexing and visualizing each of them separately and works just fine, so I thought I was merging and sorting wrong. Then I tried creating a single .cdx for each warc.gz and configuring CDXCollection.xml for multiple CDX to visualize all 4 together, but I still get the error!
Besides, I checked the permissions and ownership of the files and everything looks fine.
catalina.out: Jan 11, 2021 7:09:28 PM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: Unable to locate(WEB-20210111152504658-00000-14001~1234.warc.gz) /20210111152510/http://www.4set.eus/ Jan 11, 2021 7:09:28 PM org.archive.wayback.webapp.AccessPoint logError WARNING: Runtime Error org.archive.wayback.exception.ResourceNotAvailableException: Unable to locate(WEB-20210111152504658-00000-14001~1234.warc.gz)
Thank you.
I hope you can help me.
Hi @pqhais, the error message "Resource you have requested is temporarily unavailable." is usually not an issue with the cdx file. It seems like you have already checked most of the things that should be checked in troubleshooting. Since you say you can successfully index and replay each WARC individually, I am wondering if, in addition to sorting your index.cdx file, did you also sort your path-index.txt?
Hi @pqhais, the error message "Resource you have requested is temporarily unavailable." is usually not an issue with the cdx file. It seems like you have already checked most of the things that should be checked in troubleshooting. Since you say you can successfully index and replay each WARC individually, I am wondering if, in addition to sorting your index.cdx file, did you also sort your path-index.txt?
Hi @ldko thanks for your quick response!
I did sort de path-index.txt but still got the error. Btw, which are the issues that usually cause this error?
Thank you.
The error occurs when the URI is found in the CDX index, but when trying to access the WARC file given for the resource via lookup of its location in the path-index.txt, something goes wrong such as:
Could you share the content from your path-index.txt?
It seems to be working now. I forgot to save the sorted 'path-index.txt' so I was just displaying the sorted output in the terminal but using the unsorted version in Wayback.
I was going around in circles, thank you so much for your advice!
Came here through google, and since this seems to be the only issue addressing this, I'll add my solution:
I was using the docker setup, and even though it indexed the example.com
example properly, I could not view it.
In the end, it worked when using the exact same setup as in the wiki, I tried mapping to port 8080 from docker to 8082 outside while also setting the env variables accordingly and the website loaded, however it never found the example warc when asked to display it. Once I freed port 8080 and ran with the original settings, it worked fine though. Maybe thats due to some kind of internal routing breaking when the outside port is different to the inside one? But running with docker container run -it --rm -v /tmp/owb:/data -p 8082:8082 -e WAYBACK_URL_PORT=8082 -e WAYBACK_URL_PREFIX=http://localhost:8082 iipc/openwayback
didn't work either, so I guess the port is hardcoded somewhere.
Hi @bjrne - I think this might be down to the awkward way that parameters like WAYBACK_URL_PORT
only refer to how the service is accessed, rather than configuring how it runs. Setting that variable tells OpenWayback that it should be accessed via the given port from the outside, but does not change the port it runs as (which is still 8080).
i.e. this should work:
docker container run -it --rm -v /tmp/owb:/data -p 8082:8080 -e WAYBACK_URL_PORT=8082 -e WAYBACK_URL_PREFIX=http://localhost:8082 iipc/openwayback
Note the -p 8082:8080
part.
we are testing openwayback using a .warc file generated by heritrix. we run openwayback on centos7+tomcat7. OWB seems capable of indexing urls the .warc file. however, when we click the version (date) shown on the search result, OWB reports: Resource Not Available The Resource you have requested is temporarily unavailable. Please try again later.
any suggestions and help would be appreciated.