pwhipp / dsnac

This is a highly extendable open online book archive service.
0 stars 0 forks source link

Add Search capability to be main search to be able to search inside the whole collection #21

Closed oldrepository closed 9 years ago

oldrepository commented 9 years ago

The main search feature should be able to search for a term inside the books and display pages where that term appears and then the user clicks on the link and the book opens that particular page with the term highlighted.

pwhipp commented 9 years ago

As discussed, there is a lot of work involved here, getting the OCR working, setting it up as an automated process operating over the scanned books and including the resulting searchable information into a scheme that will allow immediate 'click' to go to the page.

I'm not sure how long this will all take - it is days rather than hours. To keep control of the issues, I've moved this issue to its own milestone 'In book search' and will add other issues under that milestone to break this down appropriately.

I'll initially spend a few hours getting tesseract working and building a model that then allows the data it generates to be searched and linked to the scanned pages - this will generate the other issues in the milestone but will also hopefully give you some test information you can review so that we can assure ourselves that tesseract is of suitable quality.

As discussed, this will all be English only to begin with but I'll make sure that the code can easily support Punjabi as and when we find appropriate tesseract modules for it.

pwhipp commented 9 years ago

As we are using page images and not page text, highlighting the search term will be very difficult and take a great deal of effort.

I've implemented a solution that searches and delivers the page.

A compromise for the highlighting might be to open a 'BookPage' entry or the OCR text. The BookPage could display the image of the page and the OCR text (with the match highlighted when possible). This has the problem that a lot of the OCR text needs manual editing or it wont look very good.

I suggest that you don't worry about highlighting the text for the time being.

oldrepository commented 9 years ago

Lets leave highlighting the search term for now. I like the idea of displaying book page images instead of the search term. Is there a possibility to display a thumbnail of the book page in the search results? And also when you click on the search results, it takes you to the book details page and then you have to click on "Read" to open the book and then it goes to the search result page. Is it possible if we can skip clicking on the read button step, instead it directly opens the book to the search term page?

hqpr commented 9 years ago

done with 'skiping' page