Use existing xhtml files for syncabook

Audun97 commented 1 year ago

Hello, I managed to get it to work when using text files. However to maintain formatting I tried to write a python script to add id attributes to existing xhtml files. It seems to work partly as when I change the page in colibrio reader or change position by pressing on a word it goes to that part in the audio. The only thing missing is the highlighting of text. Do you know of any requirement for it to work? If we managed to get it working I think it would be a nice addition to syncabook

Here is the script:

from bs4 import BeautifulSoup, NavigableString
import re

def add_ids_to_spans(html_file):
    """
    This function takes a given HTML file and adds span with an unique IDs to each senetence clause. The IDs are generated by 
    incrementing a counter starting from 1 and formatted as "fXXX" where "XXX" is the zero-padded count. If a string of text in a 
    p element contains no punctuation, comma, semi-colon, exclamation mark or question mark and at least one letter, a new 
    span is created to wrap around the text and assigned the next ID. If the text contains a punctuation, comma, semi-colon, 
    exclamation mark or question mark, the text is split into separate clauses based on that and each sentence is wrapped in a 
    separate span with the next ID. The processed HTML file is saved with "_processed.xhtml" added to the original file name.

    Parameters:
    html_file (str): The path to the HTML file to be processed.

    Returns:
    None
    """

    # Define the output file name by splitting the input file name and adding "_processed.xhtml" to it
    outputfile = html_file.rsplit(".", 1)[0] + "_processed.xhtml"

    # Open the input file in read mode with UTF-8 encoding
    with open(html_file, "r", encoding="utf-8") as file:
        # Use BeautifulSoup to parse the file
        soup = BeautifulSoup(file, "lxml")

    # Get all the p elements in the file
    p_elements = soup.find_all("p")

    # Initialize a counter for the "id" values
    id_counter = 1

    # Loop through each p element
    for p in p_elements:

        # Get all the descendants of the current "p" element. Not feeding it directly into the next loop to prevent an endless loop
        children = list(p.descendants)

        # Initialize a span for the case that multiple child.strings will share a span
        current_span = None

        # Loop through each child of the "p" element
        for child in children:

            # Skip if the child is a NavigableString
            if isinstance(child, NavigableString):
                continue

            # Skip if the child has no string value. I.e. the child has multiple children of its own
            elif child.string is None:
                continue

            elif child.string is not None:

                # If there is no punctuation and there are alphabetical characters in the string, wrap the child in a new "span" tag
                if not re.search(r'[:;!,?.]', child.string) and re.search(r'[a-zA-Z]', child.string):
                    if current_span is None:
                        current_span = soup.new_tag("span")
                        current_span["id"] = f"f{str(id_counter).zfill(3)}"
                        id_counter += 1
                    child.wrap(current_span)

                # If there is alphabetical characters in the string, split the string into clauses and wrap each sentence in a new "span" tag
                elif re.search(r'[a-zA-Z]', child.string):
                    current_span = None
                    sentences = re.split(r'(?<=[.?!;:,])\s+(?=[A-Za-z])', child.string)
                    child.clear()
                    for sentence in sentences:
                        span = soup.new_tag("span")
                        span["id"] = f"f{str(id_counter).zfill(3)}"
                        span.string = sentence + " "
                        child.append(span)
                        id_counter += 1

                else:
                    # Printing what was not caught
                    print(str(child.string))

    with open(outputfile, "w", encoding="utf-8") as file:
        # Write the processed soup object to the output file with no extra formatting
        file.write(soup.decode(formatter=None))

add_ids_to_spans(r"intputfile.xhtml")
print("process has completed successfully")

Audun97 commented 1 year ago

Ahhh, because my xhtml already references a css style sheet. A new one is not made which contains the highlighting style, I presume. Does syncabook support custom css? If so where to put it in the file structure?

Edit: I managed to get it working with highlighting when editing the finished epubs reference to css style sheet.

One just need the css style sheet to have this

.-epub-media-overlay-active {
    background-color: #FFFF00;
}

r4victor commented 1 year ago

@Audun97 nice! We can leave the issue open so that people who want to use their xhtmls can find your script. I think this functionality can be integrated into syncabook but it would require some more consideration. The script will work for xhtmls structured in a particular way – with text contents directly inside paragraphs. What if they are nested in spans? If I'm going to add this, I'd need to explore what are the common xhtml epub structures are and whether it's possible at all to cover most cases.

The issue is only if you want to preserve xhtml structure/formatting. If we just want to use existing xhtml to produce a synced ebook, we can just extract the text and produce new xhtml according to syncabook's structure. This is easy. But I don't this is necessary since people can probably find plaintext files with the same content as well.

So I suggest we leave the issue open and see if there are more people who want to use xhtml with formatting preserved or not. You can rename it to something like "Use existing xhtml files for syncabook".

Audun97 commented 1 year ago

"The script will work for xhtmls structured in a particular way – with text contents directly inside paragraphs. What if they are nested in spans?"

That is what the current_span is supposed to keep track of. Upon testing of more files it hangs up. It should test if the children are in the same p and their string does not contain a punctuation, comma, semi-colon etc they should be wrapped inside the same span.

At least every html/xhtml I have seen uses paragraph tags (p tags) so it is from that basis the script works. The descendants property in for example "p.descendants" is nice as it allows me to go through all the children recursively

dhouck commented 1 year ago

I canʼt get this to work with text contents directly inside paragraphs; it completely misses such text because itʼs NavigableStrings. It seems to work if the text is in at least one element nested inside the p and text is only in leaf elements (no This example will not work., or only the not will be picked up).

Does that match what youʼre seeing or am I missing something?

I think the way it should work is as follows:

Detect sentences boundaries in the concatenation of p.strings
For each sentence (from the start to the first boundary, from the first boundary to the second boundary, etc., ending with the last boundary to the end of the paragraph, looking at only those bits that are non-empty)
- If the start of the sentence and the boundary punctuation have the same parent, then wrap a span around them.
- If not, use the minimum number of spans possible to exactly cover the sentence
- Note that it may be difficult to handle some edge cases correctly. For example, both This sentence should be doable in one element. and This sentence should too. should work without splitting, even though in each the start and end might not seem to have the same parent.

I donʼt know enough about BS4 to actually code that, although I might try over the weekend and get farther than I expect. There are probably edge cases Iʼm not thinking of, too, although probably most of them are things aeneas would also choke on.

Audun97 commented 1 year ago

Hey, @dhouck sorry for responding so late. The problem I had was that I tried to wrap the same string twice. For example <a>sentence</a>. Here the a tag has the string sentence as span. I managed to fix it now. For your case I have also made a fix. Take a look in my repository https://github.com/Audun97/audio-ebook-id-inserter

If you find any more edge cases let me know

sjabsr commented 1 year ago

Just a +1 for such a feature : we are a non-profit foundation lending audio books for the blind and other conditions preventing from reading. We've started producing epubs this year, syncing them to our human read audio books, and two cases for us would greatly benefit from this feature : 1) ebooks modified for accessibility, so we need to use the specific xhtml files for the sync 2) ebook with syllable colorization (for the dyslexic public), again, to use the specific existing xhtml files.

I'm looking forward to try this code !

dakomi commented 10 months ago

@Audun97 hey! Do you still have that fixed script available? The link you shared is dead, and I couldn't find it elsewhere on your github repos

Thanks!

r4victor / syncabook

Use existing xhtml files for syncabook #21