Open tabshaikh opened 4 years ago
@tabshaikh so you're saying you already have the table of contents info in that file, but it's not on the OL site yet - like it's not automated to be added in? Interesting.
Yes there is already table of contents info(as the book reader reads this text) but it is not being automated to show up on the site
@tabshaikh Can you elaborate on
the books page does not the table of content of the book in About the Book even if the book has a table of contents.
Specifically where is the table of contents stored for the link provided in the original issue description.
The various forms of IA data are available from https://archive.org/download/searchengineopti00kris and the TOC can be derived from https://archive.org/download/searchengineopti00kris/searchengineopti00kris_scandata.xml (although there's also a likely looking https://archive.org/download/searchengineopti00kris/searchengineopti00kris_toc.xml which is protected, perhaps because this is a copyrighted book).
The _scandata.xml
file has TOC pages tagged with <pageType>Contents</pageType>
and that information can be used to extract the appropriate text from the searchengineopti00kris_djvu.xml
text file, or, in the case of protected books such as this one, fetched through the closed source API that the bookreader uses to get its Read Aloud text.
Thanks @tfmorris this is right on. Related: #683 and #2384
One significant complication which I meant to mention, but left out, is that converting raw page text into structured TOC data, in the syntax used by Open Library, is a substantially non-trivial task. Using the raw text and letting the humans do the parsing cerebrally would be much easier.
@tfmorris Thank you so much for providing the elaboration, I updated the issue to reflect your comments :)
Assignees removed automatically after 14 days.
When searching for a book and landing on the books page, the books page does not the table of content of the book in
About the Book
even if the book has a table of contents. eg https://openlibrary.org/books/OL10295090M/Search_Engine_OptimizationTherefore there is an opportunity here to improve our website and content by Adding Table of Contents(we have ~500k) text for every book where we already have it into OpenLibrary.org book pages (from archive.org items) using BookReaderGetTextWrapper.php & scandata.xml
The various forms of IA data are available from https://archive.org/download/searchengineopti00kris and the TOC can be derived from https://archive.org/download/searchengineopti00kris/searchengineopti00kris_scandata.xml (although there's also a likely looking https://archive.org/download/searchengineopti00kris/searchengineopti00kris_toc.xml which is protected, perhaps because this is a copyrighted book).
The _scandata.xml file has TOC pages tagged withContents and that information can be used to extract the appropriate text from the searchengineopti00kris_djvu.xml text file, or, in the case of protected books such as this one, fetched through the closed source API that the bookreader uses to get its Read Aloud text.
One significant complication is that converting raw page text into structured TOC data, in the syntax used by Open Library, is a non-trivial task. Using the raw text and letting the humans do the parsing cerebrally would be much easier.
Stakeholders
@mekarpeles @tabshaikh