ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Do we still need source_file_path? #308

Closed anjackson closed 1 month ago

anjackson commented 1 year ago

As far as I can tell, there is not longer any reference to filling out the source_file_path Solr field. Is it not longer used? Can we drop it?

anjackson commented 6 months ago

Looks like I dropped SOURCE_FILE_PATH in https://github.com/ukwa/webarchive-discovery/commit/82f6c9db9a8542d63d1e55cd2fe0bbb08e5d5d5f#diff-6f827fcfa0b427773105ef9a05ccba9498c1b6f1d331624461ee6cf57cad5582L515 on Sep 29, 2022 but this may have been a mistake from the futzing about on the Hadoop 3 branch.

I think the modified line uses the full path and overwrites the value passed in and set at https://github.com/ukwa/webarchive-discovery/blob/9d07645953d252f415b1aad4dc43766258363ea5/warc-indexer/src/main/java/uk/bl/wa/indexer/WARCIndexer.java#L560

anjackson commented 6 months ago

Looks like this arose because the Hadoop version passes in the full path, and only needs the full path. But the CLI version behaves differently, passing in the filename for SOURCE_FILE and then setting SOURCE_FILE_PATH from:

https://github.com/ukwa/webarchive-discovery/blob/9d07645953d252f415b1aad4dc43766258363ea5/warc-indexer/src/main/java/uk/bl/wa/indexer/WARCIndexer.java#L513C48-L513C48

But I'm not sure where ArchiveRecordHeader.getReaderIdentifier is getting its information from or whether it's what we want. Ah, looks like it gets passed in during setup and read at

https://github.com/iipc/webarchive-commons/blob/7cb3784cea86feedae335e7ec45ba46e924099d9/src/main/java/org/archive/io/warc/WARCReaderFactory.java#L236

Where a File gets passed to the setup at:

https://github.com/ukwa/webarchive-discovery/blob/9d07645953d252f415b1aad4dc43766258363ea5/warc-indexer/src/main/java/uk/bl/wa/indexer/WARCIndexerCommand.java#L202

Looks like the Hadoop clients get this backward, only passing in the filename when setting up the readers, e.g.:

https://github.com/ukwa/webarchive-discovery/blob/9d07645953d252f415b1aad4dc43766258363ea5/warc-hadoop-recordreaders/src/main/java/uk/bl/wa/hadoop/mapreduce/WebArchiveRecordReader.java#L185

(which would have made SOURCE_FILE_PATH be just the filename, which was likely why it was overidden.)

thomasegense commented 6 months ago

Source_file_path is used for playback in SolrWayback to load the payloads. The source_file (only file name) is also used for some logic, so I would prefer to keep both as they were in 3.2. I will make a PR soon.

thomasegense commented 1 month ago

@GilHoggarth This can be closed. The PR has been merged

GilHoggarth commented 1 month ago

Given the PR has already been merged, then happy to close. Thanks