sciencehistory / chf-sufia

sufia-based hydra app
9 stars 4 forks source link

PDFs: Full-text search/indexing #1136

Open HKativa opened 6 years ago

HKativa commented 6 years ago

A related, big picture issue that came up when discussing PDF works with stakeholders (see #994) is whether it would be possible to enable full-text searching and indexing of the PDFs. Specific use case for Jim: ideally, searches for the title of a work included in the Neville catalog would bring up the Neville catalog in search results. Specific use case for Lee: full-text keyword searching for content in an oral history transcript that isn't necessarily reflected in the metadata fields (a current feature of the oral histories microsite). With regards to the oral histories, we discussed what purpose including the oral histories in the Digital Collections serves, namely discoverability relative to the rest of our collections. In that vein, a deep-dive search into the content of an oral history (and across transcripts) may best be served by the microsite, while the Digital Collections enhances discoverability, as well as easy access and download.

At this time, I'm not sure if either use case is compelling enough for us to pursue depending on the work involved, but am documenting here for purposes of discussion.

jrochkind commented 6 years ago

It's certainly possible, it's definitely additional non-trivial work.

It may also do weird things to search results if most things in our collection do not have full text results, but a small minority of things do -- you may find those things with full text indexing coming up in search results disproportionately, since they have so many more words indexed than the majority of metadata-only records.

It's definitely possible to do though, with some work.

Does the "micro site" currently even support full text searching? I had not realized that, if it does!