govCMS / GovCMS7

Current stable release of the main Drupal 7 GovCMS distribution, with releases mirrored at https://www.drupal.org/project/govcms
https://www.govcms.gov.au/
GNU General Public License v2.0
112 stars 74 forks source link

As a site user I want to be able to search for attachments by keywords so that I can find relevant content #139

Open gollyg opened 8 years ago

gollyg commented 8 years ago

Adding the ability to index file attachments and return results via Solr is a core requirement for government entities that host forms and information sheets on their website.

Currently Solr will index Drupal content, but not the text within the documents.

This can be addressed using https://www.drupal.org/project/search_api_attachments.

fiasco commented 8 years ago

In addition to #140 this should be considered apart of how search works on govCMS so it would be good to be able enable it with search be default.

gollyg commented 8 years ago

The most recent pull request:

gollyg commented 8 years ago

@fiasco In relation to concerns about the file sizes, the module allows you to configure a file size limit on what will be indexed. As this is stored in the conf array, it can be hard-coded into a settings file, and is therefore not overridable.

fiasco commented 8 years ago

How does the module deny attachments from a UX standpoint?

gollyg commented 8 years ago

As in, does it tell the user that the file will not be included in search due to its size? Or what is the settings form like?

It does not stop people uploading large files - it will just ignore them during indexing.

aleayr commented 8 years ago

I imagine that while we may not get full-text searching on documents larger than a specified amount, I assume the file doesn't get ignored entirely, and that if you typed in the file name or similar, it would still show up in the search results?

This won't exclude the file from the results completely will it, just the ability to full-text search on the contents of a file?

@invisigoth?

invisigoth commented 8 years ago

@aleayr my understanding is extracting text from files larger than the set limit will not be attempted. Also not many files can be parsed by Tika in a stream fashion.

Also see https://docs.acquia.com/articles/issues-large-attachments-and-solr-search. It states "The Acquia Search infrastructure's Tika extractor does not allow extracting text from files larger than 20MB."

pandaskii commented 8 years ago

Drupal's Apache Solr Attachments and Search API attachments modules need to keep all of the data returned from the database by Tika in memory. This can make PHP run out of memory. The data for a single document needs to be kept in PHP memory, which can possibly cause out of memory errors if that data is too large. Indexing operations normally work on batches of nodes or entities. Each of these entities could potentially have large amounts of extracted text, increasing the possibility of out of memory errors (even if each document was small) - https://docs.acquia.com/articles/issues-large-attachments-and-solr-search

  • Regarding a FULL text search, as Acquia's guideline @aleayr

Even if all of the preceding steps work, Solr itself has a hard limit on amount of tokens per field defined in solrconfig.xml, for example:

20000

Even if everything else works, if the extracted text has more words than the maxFieldLength, Solr will truncate the indexed data to this amount. Smaller PDF files (such as a 5MB PDF file) can contain far more than 20,000 words. https://docs.acquia.com/articles/issues-large-attachments-and-solr-search

gavintapp commented 8 years ago

We have an agency using govCMS that would like this function added to govCMS. They have a sizeable body of content in PDF form. Its not cost effective to convert this content to html, but still important it remain available, and discoverable, on their website.

I believe they could accept some PDF files being excluded from search indexing due to file size > 20mb or indexing being limited to the first 20,000 words.

Initially, we could add these limitations to their user documentation. Eventually, it might be better if the content publisher is warned that a PDF they are publishing exceeds these limits.

aleayr commented 8 years ago

@invisigoth @jozhao Would like to discuss this one further for future inclusion. Do you have any issues with this?

fiasco commented 8 years ago

The biggest concern with providing this feature is the reliability it could have. The SaaS platform supports 3 search engines: Google Search (coming soon), Apache Solr (Acquia Search) and Funnelback. We don't have anyway of knowing how well this feature could perform across all three search integrations. We do however know there are size limitations on Acquia Search that would cause user experience problems when their PDFs didn't index.

We've not had continued interested in this topic. So we're going to remove it from triage for now. But if someone does what to pick this up again, it should start with writing some test cases and scenarios to outline the expectations this feature should provide. I suspect it won't be as simple as added the Search API Attachment module but will also require providing validation over uploaded attachments and proving a base level of capability across all 3 search services.

mgdhs commented 8 years ago

The Funnelback module won't be affected by this. Funnelback crawls the rendered site like a browser, so it picks up linked files. The module is about integrating results into the site via an API. I'd also be surprised if anyone is using the default Funnelback module, as it is quite limited and broken with the current API version. I'd guess the Google Search module is the same.

This is a big issue for us moving to Solr. We've made the decision that we won't include files in our Solr search for the first release.

At a minimum, this limitation should be documented and made clear for any departments looking at using govCMS.

Podgkin commented 8 years ago

We are also definately still interested in this.

We could potentially add tasks to our backlog to write scenarios and test scripts for our Solr use cases, but setting up test cases for multiple search engines we are never going to use, is beyond our scope.

To echo @mgdhs Google CSE already indexes attachments, so it seems like only Solr actually needs this module.

Assuming we did take on all that work of testing uploads, size limits, memory limits and text limits on all search engines with all search modules, what then? What if this module doesn't work with Funnelback or Google, does that mean Solr users can never search attachments?