internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.09k stars 1.33k forks source link

Add Table of Contents text from IA to book pages #3237

Open tabshaikh opened 4 years ago

tabshaikh commented 4 years ago

When searching for a book and landing on the books page, the books page does not the table of content of the book in About the Book even if the book has a table of contents. eg https://openlibrary.org/books/OL10295090M/Search_Engine_Optimization

Therefore there is an opportunity here to improve our website and content by Adding Table of Contents(we have ~500k) text for every book where we already have it into OpenLibrary.org book pages (from archive.org items) using BookReaderGetTextWrapper.php & scandata.xml

The various forms of IA data are available from https://archive.org/download/searchengineopti00kris and the TOC can be derived from https://archive.org/download/searchengineopti00kris/searchengineopti00kris_scandata.xml (although there's also a likely looking https://archive.org/download/searchengineopti00kris/searchengineopti00kris_toc.xml which is protected, perhaps because this is a copyrighted book).

The _scandata.xml file has TOC pages tagged with Contents and that information can be used to extract the appropriate text from the searchengineopti00kris_djvu.xml text file, or, in the case of protected books such as this one, fetched through the closed source API that the bookreader uses to get its Read Aloud text.

One significant complication is that converting raw page text into structured TOC data, in the syntax used by Open Library, is a non-trivial task. Using the raw text and letting the humans do the parsing cerebrally would be much easier.

Stakeholders

@mekarpeles @tabshaikh

BrittanyBunk commented 4 years ago

@tabshaikh so you're saying you already have the table of contents info in that file, but it's not on the OL site yet - like it's not automated to be added in? Interesting.

tabshaikh commented 4 years ago

Yes there is already table of contents info(as the book reader reads this text) but it is not being automated to show up on the site

xayhewalo commented 4 years ago

@tabshaikh Can you elaborate on

the books page does not the table of content of the book in About the Book even if the book has a table of contents.

Specifically where is the table of contents stored for the link provided in the original issue description.

tfmorris commented 4 years ago

The various forms of IA data are available from https://archive.org/download/searchengineopti00kris and the TOC can be derived from https://archive.org/download/searchengineopti00kris/searchengineopti00kris_scandata.xml (although there's also a likely looking https://archive.org/download/searchengineopti00kris/searchengineopti00kris_toc.xml which is protected, perhaps because this is a copyrighted book).

The _scandata.xml file has TOC pages tagged with <pageType>Contents</pageType> and that information can be used to extract the appropriate text from the searchengineopti00kris_djvu.xml text file, or, in the case of protected books such as this one, fetched through the closed source API that the bookreader uses to get its Read Aloud text.

mekarpeles commented 4 years ago

Thanks @tfmorris this is right on. Related: #683 and #2384

tfmorris commented 4 years ago

One significant complication which I meant to mention, but left out, is that converting raw page text into structured TOC data, in the syntax used by Open Library, is a substantially non-trivial task. Using the raw text and letting the humans do the parsing cerebrally would be much easier.

tabshaikh commented 4 years ago

@tfmorris Thank you so much for providing the elaboration, I updated the issue to reflect your comments :)

github-actions[bot] commented 7 months ago

Assignees removed automatically after 14 days.