netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
102 stars 21 forks source link

No search results #192

Closed mbreemhaar closed 2 years ago

mbreemhaar commented 3 years ago

Hi,

I have some warc files created using warcit. Somehow after indexing (without errors or warnings), I can't find any page included in it on SolrWayback. I tested the files using pywb but that all works fine. I also tested some other warc files in SolrWayback and they work fine too. Now I'm not sure if there is a problem with my files or if it is a bug in SolrWayback.

I attached one of my warc files: (removed link)

Can anybody help me? I would really like to know why this is not working. Thanks!

thomasegense commented 3 years ago

Hi. I can confirm there seems to be something fishy with the warc-file. The indexer outputs: WARC Indexer Finished in 3.373 seconds. INFO Instrument - Performance statistics WARCIndexer#content_types(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%) top 20 sort=time WARCPayloadAnalyzers.analyze#droid(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%) top 5 sort=avgtime WARCIndexer.extract#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%) WARCIndexer.extract#archeaders(#=1, time=258.41ms, avg=0.00#/ms 258.41ms/#, 7.65%) WARCIndexerCommand.main#total(#=1, time=3378.43ms, avg=0.00#/ms 3378.43ms/#, 99.99%) WARCIndexerCommand.parseWarcFiles#startup(#=1, time=3022.82ms, avg=0.00#/ms 3022.82ms/#, 89.46%) WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=1, time=348.78ms, avg=0.00#/ms 348.78ms/#, 10.32%) WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=49, time=300.01ms, avg=0.16#/ms 6.12ms/#, 8.88%) SolrRecord.removeControlCharacters#total(#=507, time=14.66ms, avg=34.58#/ms 0.03ms/#, 0.43%) SolrRecord.sanitiseUTF8(#=507, time=9.09ms, avg=55.75#/ms 0.02ms/#, 0.27%)

This tells that 49 documents should be created in Solr, but it seem not a single document is created. My suspicion is the line in the warc headers: WARC-Type: resource

The normal values we have seen in warc files are: 'response' or 'revisit'. I have assigned the issue to @tokee

thomasegense commented 3 years ago

Ups, closed it by mistake

tokee commented 3 years ago

Summary: This was due to a shortcoming in webarchive-discovery. It is fixed in the mentioned pull request and will either be merged to the fork used by SolrWayback when the pull request is accepted or @thomasegense gets impatient and does it directly.

Thank you very much for the bug report at the nice sample data, @mbreemhaar

thomasegense commented 3 years ago

Also it will require several changes in SolrWayback as well. Playback depends on information in http header (status code, encoding) etc. and all that information is missing here. But with Toke's fix there was 47 documents generated in Solr (2 documents discarded for some reason). I hope to get time to look into it next week if it is an easy fix.

thomasegense commented 3 years ago

@mbreemhaar I have a fix ready. But you will need a custom SolrWayback software bundle from me to download, if you still interested in exploring your nice old 2002 collection. Tell me here if you want it. I can also email you some screenshots from your collection seen in SolrWayback.

The fix (support for warc type 'resource') will make it into the next release, but it can take some time. It was also necessary to fix it in the warc-indexer (as Toke it).

thomasegense commented 3 years ago

@mbreemhaar Can I add the Warc-file you gave me to the unit-test resources for warc-parsing? I added a test for the "resource" type warc-filers.

mbreemhaar commented 3 years ago

@thomasegense It would be great if you could send me the custom software bundle with the fix. Unfortunately the collection I'm working with is not my own data, so I'll have to ask the owner for permission to have it added to the unit-test resources. I will let you know as soon as possible if this is okay.

thomasegense commented 3 years ago

@mbreemhaar You can download the patch here: https://drive.google.com/file/d/18gya3-vzKCb9xA7OolmsegO72FCVgmRt/view?usp=sharing Unzip 1)Replace the files in the indexing folder 2) Replace solrwayback.war (apache-tomcat-8.5.60/webapps/) 3) Replace schema.xml (solr-7.7.3/server/solr/configsets/netarchivebuilder/conf/)

Then restart solr and the tomcat and index the files again.

Tell me if it works :)

Your warc file also has some entries that seems to be some statistics about the warc files. (txt files). Maybe warcit creates them. And since they are warc records they will be indexed as well. You can see them in the search results, but they make no harm for playback or anything.

thomasegense commented 3 years ago

@mbreemhaar Was it fixed? So I can close the issue. (closed it by mistake).

mbreemhaar commented 3 years ago

Yes, it's working! Thank you very much!

thomasegense commented 3 years ago

Excellent! Enjoy your collection and all the data visualizations of the collection + playback.

Can I keep the warc file for unittest of warc type 'resource'?

mbreemhaar commented 3 years ago

It's not my data, so I asked the owner. I will let you know as soon as I get his response.

mbreemhaar commented 3 years ago

@thomasegense I'm sorry, unfortunately I am not allowed to publicly spread files from the collection.

thomasegense commented 3 years ago

@mbreemhaar I have deleted my copy. Thanks for sharing it though, was a great help for making support for 'resource' warc-files.

mkrzmr commented 2 years ago

Hello, I think I am having the same problem. Could you reupload the patch please?

thomasegense commented 2 years ago

@mkrzmr The latest 4.3.2 release has the fix included. Is it not working? Can you then share the WARC file? Or give me the start if the WARC file? Warc meta data + first two headers for first record.

mkrzmr commented 2 years ago

Sure, here is the file.

It's just a random warc I made with the instructions from the documentation to see if the issue was with my files. But so far, I have not been able to index a warc.

And here is part of the log: Parsing Archive File [1/1]:warcs1/www.iana.org.warc.gz WARN WARCIndexer - Invalid status line: null@298 WARN WARCIndexer - Invalid status line: null@40024 WARN WARCIndexer - Invalid status line: null@40334 WARN WARCIndexer - Invalid status line: null@42113 WARN WARCIndexer - Invalid status line: null@44541 WARN WARCIndexer - Invalid status line: null@46431 WARN WARCIndexer - Invalid status line: null@46813 WARN WARCIndexer - Invalid status line: null@49678 WARN WARCIndexer - Invalid status line: null@52068 WARN WARCIndexer - Invalid status line: null@55623 WARN WARCIndexer - Invalid status line: null@57798 WARN WARCIndexer - Invalid status line: null@65220 WARN WARCIndexer - Invalid status line: null@68046 WARN WARCIndexer - Invalid status line: null@68433 WARN WARCIndexer - Invalid status line: null@70862 WARN WARCIndexer - Invalid status line: null@77926 WARN WARCIndexer - Invalid status line: null@80239 WARN WARCIndexer - Invalid status line: null@83229 WARN WARCIndexer - Invalid status line: null@230412 WARN WARCIndexer - Invalid status line: null@263287 WARN WARCIndexer - Invalid status line: null@294448 WARN WARCIndexer - Invalid status line: null@294795 WARN WARCIndexer - Invalid status line: null@299541

thomasegense commented 2 years ago

@mkrzmr Thanks for the WARC file.

I can confirm I have identical problems as you have. I have reopened the issue.

The WARC type is 'resource' which is what this original issue was about.

I will look into it tomorrow at work. Maybe type 'resource' was not fixed entirely or merge/regression error.

What documentation did you use to create the WARC file? It has been created from a file system and not from a WEB harvest. (WebRecorder, Heritrix,wget etc.)

If you just want to try to SolrWayback, you can harvest some websites using the wget examle described in the README page for SolrWayback and index them.

/Thomas

mkrzmr commented 2 years ago

Thanks Thomas,

the warc file was created with Warcit:


For example, the following example will download a simple web site via wget (for simplicity, this retrieves one level deep only), then use warcit to convert to www.iana.org.warc.gz:

wget -l 1 -r www.iana.org/
warcit http://www.iana.org/ ./www.iana.org/

The WARC www.iana.org.warc.gz should now have been created!

I have a collection made from local files I want to import, just made the iana.org file to try and see where the error comes from. Seems to be warcit always adds 'resource'

/Michael

thomasegense commented 2 years ago

@mkrzmr Thanks, that explains why these files are still made. But wget can create WARC-files directly. See the solrwayback README at: https://github.com/netarchivesuite/solrwayback and scroll down to : 5) CREATING YOUR OWN WARCS - HARVESTING WITH WGET This will also gives you a lot more options such as follow links 1 level.

The advantage using a web-harvest is the HTTP header will be included. This will give information about encoding, http status etc.. If resources are moved (3xx codes) this information will also be saved and is important for correct playback.

Still the issue needs to be resolved of course and hope it can be solved today.

thomasegense commented 2 years ago

@mkrzmr Hi, there was indeed a merging error. It will be part of the comming 3.5.0 release. You can download the new warc-indexer here: https://drive.google.com/file/d/13aza6dO2MXBmxvpxc5fN94mZLW4gGShF/view?usp=sharing

Still if you make your own WARC-files I will recommend using a Web crawler or just WGET. Please tell me if it works for you :) Then I can close this again, for good this time :)

mkrzmr commented 2 years ago

Thank you, I can confirm the issue is resolved

thomasegense commented 2 years ago

Thanks, closing again. 3.5.0 will be released in some weeks I hope.