w3c / publ-cg

EPUB 3 Community Group Repository
Other
44 stars 16 forks source link

Official community for EPUB? #103

Open hmltn-0 opened 2 years ago

hmltn-0 commented 2 years ago

I’d like to write a script to render EPUB with Python.

Is there an official EPUB community mail list where I can learn more about how EPUBs are rendered?

Thank you very much

GeorgeKerscher commented 2 years ago

Hello, This issue was sent to the Publishing at W3C Community Group. I would think that all you need to do is to join that community group and we could then help you in your efforts.

Best George

dauwhe commented 2 years ago

Perhaps you could tell us more about what you want to do? EPUB consists largely of HTML, so rendering an EPUB would involve presenting HTML to the end user.

hmltn-0 commented 2 years ago

Thanks for your message.

I want to write my own Python script which extracts the text from an EPUB.

To do this I need to understand how the files are arranged. Are they in order in the directory?

Is it as simple as getting all “p” and “h” tags or are there certain kinds of tags that contain text and certain ones that do not?

Thank you

dauwhe commented 2 years ago

To do this I need to understand how the files are arranged. Are they in order in the directory?

Not necessarily. The components of the EPUB and their order are described in the XML package file. A basic introduction to EPUB can be found in the overview.

The spec defining the file format is here

Is it as simple as getting all “p” and “h” tags or are there certain kinds of tags that contain text and certain ones that do not?

I would highly recommend using an HTML parsing library in Python. I have personal experience with Beautiful Soup.

Good luck!