iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
473 stars 271 forks source link

openwayback could index a .warc file but can not display it #392

Open guitarscape opened 5 years ago

guitarscape commented 5 years ago

we are testing openwayback using a .warc file generated by heritrix. we run openwayback on centos7+tomcat7. OWB seems capable of indexing urls the .warc file. however, when we click the version (date) shown on the search result, OWB reports: Resource Not Available The Resource you have requested is temporarily unavailable. Please try again later.

any suggestions and help would be appreciated.

ldko commented 5 years ago

Hi @guitarscape , that error usually means that the version you are looking for was found in the index as you say, but the application is unable to access the content from the WARC file where it is expecting to find it. Are you using the default configuration with BDB indexing etc.?

guitarscape commented 5 years ago

Hi lauren, thanks for your response! we only modified wayback.xml (we changed wayback.url.host.default , added wayback.url.context and changed accessPointPath to "/wayback" so that openwayback is install in tomcat non-root context (based on this: https://github.com/iipc/openwayback/issues/160)

ldko commented 5 years ago

I presume you put your .warc file in either of the default directories where owb will look for those files in either ${wayback.basedir}/files1/ or ${wayback.basedir}/files2/, and they were presumably indexed there. Is the .warc file still there? Are the file permission still readable? Has the file been altered since indexing (e.g. it started as a warc.gz and then was unzipped)?

Sorry if these questions seem obvious--I am accustomed to seeing the Resource Not Available message arise with people (such as me) using CDX indexes and path-index.txt; I can't remember having it coming up with BDB.

guitarscape commented 5 years ago

we used the default wayback.basedir (/tmp/openwayback) for testing purpose and put a warc file in /tmp/openwayback/files1. we also made sure that all files and folders are owned by tomcat:tomcat before .starting tomcat. We also tried removed file-db, index, index-data folders in /tmp/openwayback so that they are recreated when tomcat is restarted. and the outcome were observed. is there a need to modify any other xml files besides wayback.xml? Thanks for your help!

ldko commented 5 years ago

It sounds like you are minimally changing the configuration and have followed the steps given at these wiki pages, in which case you should not need to change other xml files: https://github.com/iipc/openwayback/wiki/How-to-configure https://github.com/iipc/openwayback/wiki/Deploying-OpenWayback-in-non-ROOT-Context

As far as troubleshooting goes:

Have you looked at the Tomcat log to see if there are any related messages?

You could try launching OpenWayback without any configuration changes from the default (using ROOT context) to see if it works.

To see if the issue is with the WARC file versus the setup, have you tried a different WARC file to see if you get the same result? You could also try your WARC file with a containerized instance of OpenWayback: https://github.com/iipc/openwayback/wiki/How-to-build-and-run-in-Docker

guitarscape commented 5 years ago

we stopped the tomcat, removed the existing OWB and ROOT context. Renamed the OWB .war file to ROOT.war, removed the file-db, index, index-data folders in /tmp/openwayback but kept the .warc file in files1 folder. Then we started tomcat. Exactly the same result. Tomcat log shows code 200 to visit OWB home page and to conduct a search, but code 503 (resource not available) to actually click on an version (date).

we could play the same .warc file in pywb.

ldko commented 5 years ago

With a 503 there should be some error logged to your Tomcat log. Have you looked in your Tomcat's logs/catalina.out?

guitarscape commented 5 years ago

unfortunately there is no other log entry related to the error. our set up uses vanilla centos 7.6 + openjdk 1.8, tomcat 7.0

schmika commented 4 years ago

Hi,

have you been able to find a solution to this issue? We seem to be experiencing the same problem after an OS upgrade and a new BDB index for OpenWayback 2.3.0. Although the different captures are listed, OpenWayback returns Resource Not Available when requesting a specific capture. In the logs we see for these requests:

tomcat: WARNING:(1)LOADFAIL: Unable to locate(XYZ-2739857.warc) /20171111130122/http://www.xyz.com/ 
tomcat: WARNING: Runtime Error

It would be great if you could give us some pointers!

ldko commented 4 years ago

Hi @schmika, were the same WARCs working in Openwayback 2.3.0 before the OS upgrade? What OS did you upgrade to/from? What versions of Java and Tomcat are you running--did this change from when things were working? Do you see your WARCs listed with the correct paths in ${wayback.basedir}/file-db/incoming/files1 (replace "files1" with the directory name of your WARCs if needed)?

schmika commented 4 years ago

Thank you for your reply, @ldko! We've upgraded from Red Hat 6.10 to Red Hat 7.8. The Java (1.8) and Tomcat versions (7.0) have remained the same. We were previously able to display the WARC files and their paths are listed in the file you mentioned. What is curious is that sometimes after several unsuccessful attempts the request will be successful and the archived resource will be displayed correctly. I don't know whether that's related to our issue but I have read in the documentation that a BDB index should only be used for "very small" collections - is there a limit in terms of the number of WARC files or the overall size of the files where you'd recommend switching to a CDX index?

ldko commented 4 years ago

Hi @schmika , I don't know if anyone has determined a rough size limit for a cutoff on BDB index. I pretty much only use BDB for small testing purposes, otherwise I use CDX. I wouldn't think this would be causing your problem though, sine the lookups for captures are working.

One thing that comes to mind to check is if SELinux is enabled and enforcing on the machine which could cause a problem with serving files. You can check this with the sestatus command. It does seem odd though if the same request will occasionally work...

ThomasOsterman commented 4 years ago

Dear @ldko,

Thank you very much for your answer. I am a colleague of @schmika. I just run the sestatus command and got disabled. We have 159,109 warc files. We have 159,109 warc files. With this amount, would you recommend switching to CDX index?

I found out that OpenWayback apparently tries to re-index warc files that were have already been indexed. In the Tomcat messages I find lines like this:

/var/log/messages:Jun 29 00:07:29 tomcat: INFO: Queued ABC.warc for indexing. /var/log/messages:Jun 29 00:08:28 tomcat: INFO: Indexing ABC.warc from (...) /var/log/messages:Jun 29 00:08:29 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists! /var/log/messages:Jun 29 00:27:13 tomcat: INFO: AddedABC.warc (...)/ABC.warc /var/log/messages:Jun 29 01:27:27 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists! /var/log/messages:Jun 29 02:20:49 tomcat: INFO: Added ABC.warc (...)/ABC.warc /var/log/messages:Jun 29 03:43:01 tomcat: INFO: Indexing ABC.warc from (...) /var/log/messages:Jun 29 03:43:03 tomcat: WARNING: WARNING: (...)/ABC.warcalready exists!

It seems that OpenWayback tries to index ABC.warc at 0:07, 1:27 and 3:43 again!

At 6:55 OpenWayback removes the file from the index queue and at 7:41 it adds the file again! Could this phenomenon be related to using a BDB index with our collection?

ldko commented 4 years ago

Hi @ThomasOsterman , I have not seen this behavior before. Looking at the source code, the "already exists!" warning happens when a WARC is indexed, but an index file for it is already where the new one would be copied to (index-data/incoming if it is not changed in BDBCollection.xml) before another process finds it there and moves it again to index-data/merged. I am not sure why your WARCs are getting queued for indexing more than once nor why the index files would be lingering in the index-data/incoming dir long enough to trigger the warning (there is an interval setting in BDBCollection.xml that controls how often that directory is checked). These could be symptoms of whatever is causing the WARC files not to be found when the web app is showing Resource Not Available, but I don't think they would be directly causing it.

One other thing I am wondering, are there duplicate WARC names with different paths in the file-db/state/* files?

Regarding your question about switching to CDX, I would recommend trying CDX files or CDX Server and a path-index.txt configuration. Though you could try it with a small subset of your WARCs to make sure it is working before indexing everything. I also recommend upgrading to OpenWayback 2.4.0.

ThomasOsterman commented 4 years ago

Dear @ldko,

Thank you very much for your explantations. I just checked some files and their paths in the file-db/state/* files. There are no duplicate WARC names with different paths. However, in file-db/state there are two files named files1 and files 2. Some of the WARC files are listed in both files1 and files2, some are only listed in one of the files.

ldko commented 4 years ago

@ThomasOsterman for the duplicate WARC names that are listed in both files1 and files2, are they WARC files that happen to have the same name but different content (could potentially cause a problem as I believe the names are expected to be unique since the index of what content is inside the WARC specifies a WARC name but not path and wouldn't know which to use), are they duplicates of the same WARC that exist in both listed paths (shouldn't cause the problem), or does the duplicate WARC name only exist in one place (might cause a problem)?

ThomasOsterman commented 4 years ago

@ldko They are unique WARC files that exist only in one place but are listed both in files1 and files2. For example, both of the files contain the following line: ABC.warc /home/thomas/ABC.warc There is one file ABC.warc, which is located in /home/thomas, but is listed in both of the files.

ThomasOsterman commented 4 years ago

@ldko I just noticed a configuration error in our BDBcollection.xml: We have two Wayback Archive dirs, ${wayback.archivedir.1} and ${wayback.archivedir.2}. Probably due to a careless mistake, the DirectoryResourceFileSource beans for both directories were called "files1". That might explain the phenomenon with the duplicate entries.

ldko commented 4 years ago

@ThomasOsterman That's good to know :). I just tried setting two DirectoryResourceFileSource to the same name like that, and that will cause the WARNING: (1)LOADFAIL: Unable to locate... error that @schmika reported. What happens when processing the second DirectoryResourceFileSource with the same name, is that the WARC names and their paths from that location get added to the file location db and whatever entries were added to the file location db for the first bean's path get removed. Thus, whatever WARCs were in the first location called "files1" get removed from the location db and can't be found. The content of the WARCs is still indexed, so searching for the URL returns results, but when the application then tries to look up where the WARC holding that content should be, it does not find the WARC name in the location database and has no path to serve content from. Fixing that configuration error and restarting Tomcat should fix the Resource Not In Archive message. Please let us know if it does!

ThomasOsterman commented 3 years ago

Dear @ldko,

The configuration error in the BDBCollection.xml was actually the reason for our problems and the duplicate entries. After renaming the DirectoryResourceFileSource bean for ${wayback.archivedir.2} to "files2" and deleting the old index, the index was built correctly. Thank you very much for your help!

pqhais commented 3 years ago

Hi!

I am using OWB2.4.1-SNAPSHOT to index and visualize .warc.gz files crawled with Heritrix and I'm getting the same error (The Resource you have requested is temporarily unavailable. Please try again later) but I'm using CDX instead of BDB.

I am trying with just 4 captures (1 warc.gz each) to find the error but I still can't solve it. What my indexing script does: -Create .cdx of .warc.gz -Sort -Merge to 'index.cdx' -Restart Tomcat

PD: I checked the content of both 'index.cdx' and 'path-index.txt' and they look OK.

When I try to visualize OWB returns the error. Repeating the process returns different results as somehow one of these 4 can be visualized regularly and another one just sometimes.

I tried indexing and visualizing each of them separately and works just fine, so I thought I was merging and sorting wrong. Then I tried creating a single .cdx for each warc.gz and configuring CDXCollection.xml for multiple CDX to visualize all 4 together, but I still get the error!

Besides, I checked the permissions and ownership of the files and everything looks fine.

catalina.out: Jan 11, 2021 7:09:28 PM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: Unable to locate(WEB-20210111152504658-00000-14001~1234.warc.gz) /20210111152510/http://www.4set.eus/ Jan 11, 2021 7:09:28 PM org.archive.wayback.webapp.AccessPoint logError WARNING: Runtime Error org.archive.wayback.exception.ResourceNotAvailableException: Unable to locate(WEB-20210111152504658-00000-14001~1234.warc.gz)

Thank you.

I hope you can help me.

ldko commented 3 years ago

Hi @pqhais, the error message "Resource you have requested is temporarily unavailable." is usually not an issue with the cdx file. It seems like you have already checked most of the things that should be checked in troubleshooting. Since you say you can successfully index and replay each WARC individually, I am wondering if, in addition to sorting your index.cdx file, did you also sort your path-index.txt?

pqhais commented 3 years ago

Hi @pqhais, the error message "Resource you have requested is temporarily unavailable." is usually not an issue with the cdx file. It seems like you have already checked most of the things that should be checked in troubleshooting. Since you say you can successfully index and replay each WARC individually, I am wondering if, in addition to sorting your index.cdx file, did you also sort your path-index.txt?

Hi @ldko thanks for your quick response!

I did sort de path-index.txt but still got the error. Btw, which are the issues that usually cause this error?

Thank you.

ldko commented 3 years ago

The error occurs when the URI is found in the CDX index, but when trying to access the WARC file given for the resource via lookup of its location in the path-index.txt, something goes wrong such as:

Could you share the content from your path-index.txt?

pqhais commented 3 years ago

It seems to be working now. I forgot to save the sorted 'path-index.txt' so I was just displaying the sorted output in the terminal but using the unsorted version in Wayback.

I was going around in circles, thank you so much for your advice!

bjrne commented 3 years ago

Came here through google, and since this seems to be the only issue addressing this, I'll add my solution:

I was using the docker setup, and even though it indexed the example.com example properly, I could not view it.

In the end, it worked when using the exact same setup as in the wiki, I tried mapping to port 8080 from docker to 8082 outside while also setting the env variables accordingly and the website loaded, however it never found the example warc when asked to display it. Once I freed port 8080 and ran with the original settings, it worked fine though. Maybe thats due to some kind of internal routing breaking when the outside port is different to the inside one? But running with docker container run -it --rm -v /tmp/owb:/data -p 8082:8082 -e WAYBACK_URL_PORT=8082 -e WAYBACK_URL_PREFIX=http://localhost:8082 iipc/openwayback didn't work either, so I guess the port is hardcoded somewhere.

anjackson commented 3 years ago

Hi @bjrne - I think this might be down to the awkward way that parameters like WAYBACK_URL_PORT only refer to how the service is accessed, rather than configuring how it runs. Setting that variable tells OpenWayback that it should be accessed via the given port from the outside, but does not change the port it runs as (which is still 8080).

i.e. this should work:

docker container run -it --rm -v /tmp/owb:/data -p 8082:8080 -e WAYBACK_URL_PORT=8082 -e WAYBACK_URL_PREFIX=http://localhost:8082 iipc/openwayback

Note the -p 8082:8080 part.