p0n1 / epub_to_audiobook

EPUB to audiobook converter, optimized for Audiobookshelf
MIT License
1.16k stars 119 forks source link

Create a sample search replace file #89

Open haydonryan opened 1 month ago

haydonryan commented 1 month ago

Loving this app. Thankyou all for the great work.

It would be good to crowd source some of the word replacements

There are a bunch of clear ones based on the books I read. $1 million reads as dollar one million 2010 reads as two thousand ten.

I'm currently doing these changes on the command line. Happy to contribute mines just need to confirm the format.

haydonryan commented 1 month ago

I also wonder if we should consider having two files - an included on that had been throughly vetted and custom replacements.

p0n1 commented 1 month ago

Hey @haydonryan . Thanks for reaching out. Not sure if the word replacement you mentioned would be something similar with this PR https://github.com/p0n1/epub_to_audiobook/pull/80 we have merged.

There are a bunch of clear ones based on the books I read. $1 million reads as dollar one million 2010 reads as two thousand ten.

Besides, just curious about which TTS engine are you using?

haydonryan commented 1 month ago

Oh yes good point - it's definitely going to be specific to the TTS engine. I'm currently using piper, but have been thinking about trying https://github.com/coqui-ai/TTS, but as this isn't currently a supported option, I'd export the text files before passing it on.

Still looking for the best free TTS system. I like piper but the lack of GPU acceleration is frustrating.

haydonryan commented 1 month ago

So the readme is helpful - but what regular expression syntax is it using? Eg in my script to run epub_to_audiobook I have:

# numbers will be in the form:
# 19 20 or 19o4
ls *.txt | xargs sed -i 's/2000/two thousand/g'
ls *.txt | xargs sed -i 's/200\([1-9]\)/two thousand and \1/g'
ls *.txt | xargs sed -i 's/\([0-9]\{2\}\)0\([0-9]\)/\1o\2/g'
ls *.txt | xargs sed -i 's/\([0-9]\{2\}\)\([0-9]\{2\}\)/\1 \2/g'

and some involve punctuation eg:

ls *.txt | xargs sed -i 's/Jr.’s/juniors/g'
haydonryan commented 1 month ago

I dug into the code. seems it's calling re.sub. Therefore python regex format is the one it's doing.

# Search and replace from books I'm listening to:
\$([0-9]+.[0-9])\sbillion==\1 billion dollars

This as a search and replace file didn't work.

however this did:

import re
test="$70 billion"
re.sub(r"\$([0-9]+) billion", r"\1 billion dollars", test)
e.sub(r"\$([0-9]+.*[0-9]*)\sbillion", r"\1 billion dollars", test)
'70.3  billion dollars'

Also I don't think it should be one regex per line, it's highly lkely that you'll get more than one match -

eg:

Carls Jr spent $3.1 Billion on advertising. Has two items that would not get spoken right...

Better to run the search and replace over the whole file.

p0n1 commented 1 month ago
'70.3  billion dollars'

Why would this lead to '70.3 billion dollars'?

haydonryan commented 1 month ago

sorry i fudged the example, the example above would be 70 billion dollars