Closed anjackson closed 5 months ago
Looks like I dropped SOURCE_FILE_PATH
in https://github.com/ukwa/webarchive-discovery/commit/82f6c9db9a8542d63d1e55cd2fe0bbb08e5d5d5f#diff-6f827fcfa0b427773105ef9a05ccba9498c1b6f1d331624461ee6cf57cad5582L515 on Sep 29, 2022 but this may have been a mistake from the futzing about on the Hadoop 3 branch.
I think the modified line uses the full path and overwrites the value passed in and set at https://github.com/ukwa/webarchive-discovery/blob/9d07645953d252f415b1aad4dc43766258363ea5/warc-indexer/src/main/java/uk/bl/wa/indexer/WARCIndexer.java#L560
Looks like this arose because the Hadoop version passes in the full path, and only needs the full path. But the CLI version behaves differently, passing in the filename for SOURCE_FILE and then setting SOURCE_FILE_PATH from:
But I'm not sure where ArchiveRecordHeader.getReaderIdentifier
is getting its information from or whether it's what we want. Ah, looks like it gets passed in during setup and read at
Where a File
gets passed to the setup at:
Looks like the Hadoop clients get this backward, only passing in the filename when setting up the readers, e.g.:
(which would have made SOURCE_FILE_PATH be just the filename, which was likely why it was overidden.)
Source_file_path is used for playback in SolrWayback to load the payloads. The source_file (only file name) is also used for some logic, so I would prefer to keep both as they were in 3.2. I will make a PR soon.
@GilHoggarth This can be closed. The PR has been merged
Given the PR has already been merged, then happy to close. Thanks
As far as I can tell, there is not longer any reference to filling out the
source_file_path
Solr field. Is it not longer used? Can we drop it?