teleshoes / ebook-audiobook-wordtiming

0 stars 1 forks source link

Trying to parse russian book... #1

Closed plotn closed 1 year ago

plotn commented 1 year ago

Hi Elliot! I am triyng to parse my first book in russian - so it is very bad ))))

I finally changed code from:

my @words = grep {/\w/} split /[^a-z0-9']/, lc $sentence;

to:

my $sent = $sentence; $sent =~ s/[[:punct:]]//g; my @words = split /\s+/, $sent;

But I aint good in perl and think this can be done better - I mean there are should be elliminated "..." symbol, long and middle "-" and etc. Maybe there are should be done case insensitive matching.

So It became better, but I got die "ERROR: could not parse diff line:\n$diffLine";

I changed it to print("ERROR: could not parse diff line:\n$diffLine");

and go further.

So if you want to try - the mp3 filed and the book is here: https://drive.google.com/file/d/1M9COdWa9ZWRgLWKGF-LfuCOZMD_J5gSh/view?usp=sharing

The vosk model I used is: vosk-model-ru-0.42

plotn commented 1 year ago

Finally I got a result wordtiming file. But cannot test in - CR hungs at allSentences = mReaderView.getAllSentences();

so I decided to go to sleep (

plotn commented 1 year ago

Pushkin_Dubrovsky.zip

plotn commented 1 year ago

(this is wordtiming)

plotn commented 1 year ago

Cannot sleep, lets continue.

So I added:

p->_docview->savePosition(); p->_docview->clearSelection(); p->_docview->goToPage(0); p->_docview->SetPos(0, false); //this line p->_docview->onSelectionCommand( DCMD_SELECT_FIRST_SENTENCE, 0 ); while(p->_docview->nextSentence()){

to Java_org_coolreader_crengine_DocView_getAllSentencesInternal

and this block to OnSelectCommand:

if ( currSel.isNull() || currSel.getStart().isNull() ) { // select first sentence on page if ( pos.isNull() ) { clearSelection(); return 0; } if ( pos.thisSentenceStart() ) currSel.setStart(pos); moved = true; } if ( currSel.getStart().isNull() ) { //this block if (cmd == DCMD_SELECT_FIRST_SENTENCE) { bool res; currSel.setStart(pos); res = currSel.getStart().nextVisibleWordStart(); } else { clearSelection(); return 0; } } if (cmd==DCMD_SELECT_MOVE_LEFT_BOUND_BY_WORDS || cmd==DCMD_SELECT_MOVE_RIGHT_BOUND_BY_WORDS) { if (cmd==DCMD_SELECT_MOVE_RIGHT_BOUND_BY_WORDS)

and it doesnt hungs now. But the reading and the selected sentences behave chaotic ((( too difficult for me (((

teleshoes commented 1 year ago

cool! skipping malformed diff lines is not a good sign for getting good output either, so im gonna take a look. gotta download the model and such, although i wont actually know how close it works.

when you get the *.wordtiming file, you should pick a line at random from the file and play that MP3 file at that spot, and see if the words in the file match up with the audio being read. if that part is working, then the error youre seeing is probably in my coolreader code, not in my perl processing code.

also, did you generate the *.sentenceinfo file using QT coolreader standalone app? that will ENORMOUSLY speed up initial processing time, which is probably why it was hanging.

teleshoes commented 1 year ago

ok, i downloaded the big ru model and vosk is running. ill let you know how it goes

teleshoes commented 1 year ago

biggest problem is that im matching against utf8-encoded bytes, instead of unicode code points. something like:

$sentence = decode_utf8($sentence);
my @words = grep {/\p{L}/} split /\W+/, $sentence;

but i have to actually do the intersection of \p{L} and single-quote char for contractions in english, and then NEGATE that set or do negative lookahead or something. ill look into it when i get a chance

teleshoes commented 1 year ago

ok, wordtiming perl script is fixed to support non-latin alphabets. i tested with the book you sent, and the wordtiming file produced matches perfectly with all the words i tried. i have not yet tried loading it into coolreader, but ill do that soon too.

(here is how i checked: i select a random line in *.wordtiming, get the word and the next few words, get the timing and the audio file, sound out the cyrillic like a very illiterate child, and then compare it what he reads when i play it with mpv -ss TIMING AUDIO_FILE. every time, its exactly the word i expect him to say, even though i have ZERO idea what it is that he is saying)

note: it now requires perl 5.18+ (5.36+ to remove the warnings).

edit: also, be sure to undo BOTH changes you made. the diff one is causing a lot of troubles in the wordtiming.

teleshoes commented 1 year ago

also: this is the order of the audiofiles in the wordtiming file you sent here.

Дубровский. Том 1. Глава 1.mp3
Дубровский. Том 1. Глава 2.mp3
Дубровский. Том 1. Глава 3.mp3
Дубровский. Том 1. Глава 4.mp3
Дубровский. Том 1. Глава 5.mp3
Дубровский. Том 1. Глава 6.mp3
Дубровский. Том 1. Глава 7.mp3
Дубровский. Том 1. Глава 8.mp3
Дубровский. Том 2. Глава 10.mp3
Дубровский. Том 2. Глава 11.mp3
Дубровский. Том 2. Глава 12.mp3
Дубровский. Том 2. Глава 13.mp3
Дубровский. Том 2. Глава 14.mp3
Дубровский. Том 2. Глава 15.mp3
Дубровский. Том 2. Глава 16.mp3
Дубровский. Том 2. Глава 17.mp3
Дубровский. Том 2. Глава 18.mp3
Дубровский. Том 2. Глава 19.mp3
Дубровский. Том 2. Глава 9.mp3

i THINK this is wrong, though i am not certain. anyway, this is the lexicographic sort order you would get from doing * in bash, but it looks like the first part of book2 is at the end. (it is only good chance that the files sorted MOSTLY correctly). be sure when you invoke ebook-audiobook-wordtiming that the order of the audio files is correct on the command line.

plotn commented 1 year ago

also, did you generate the *.sentenceinfo file using QT coolreader standalone app? that will ENORMOUSLY speed up initial processing time, which is probably why it was hanging.

No, I didnt. Knownreader failed to do this (thou I coppied your code - It was failed with message "cannot find ttf fonts" so I did it not to die there, but there were something more, I think". I did not tested your version, cause "there are too many forks of CR - I wanted to do it with mine". So - no. I decided to reach the working case with pandoc.

But! The KR itself generated (after fix I wrote upper) sentenceinfo - I attach it. Pushkin_Dubrovsky.zip

plotn commented 1 year ago

About the fix. I noticed, that tts not always correctly starts, because engine cannot select sentence - e.g. when there is a picture at the top of the page. So I did the fix and it seems to work good - engine selects sentence right after the picture.

I think this fix helped for "getAllSentences" too - because there are games around selections / nnext /prev sentence - it is all about the same. So I got the sentence info and it looks correct (took quite short time to generate)

plotn commented 1 year ago

i THINK this is wrong

yes, it is! This is the point to rename - I'll do that. Do you sort the files when I choose them with mask? I mean "Дубровский. Том*". Or the linux do that for us? Could we be sure of it?

plotn commented 1 year ago

So the main problem is - that the whole work fails - sound is not synced with book text (I'll try to get your fresh version, try to rename files and check again). I'll do this tomorrow. Let me know if you do earlier.

teleshoes commented 1 year ago
tts not always correctly starts, because engine cannot select sentence

yes, this is a long-standing bug in TTS. its especially true for the first sentence of the book. my code did not introduce this bug, or fix it. the actual sentence-select behavior should be the same for actual google TTS and for audiobook-tts

Do you sort the files when I choose them with mask?

no. if you do a glob pattern like * on a command line, the script receives a list of files exactly the same as if you carefully typed an order in. the script does not change that carefully selected order, and it should not.

sound is not synced with book text

try checking the wordtiming file without using coolreader, the way i described. its possible that vosk does not work as well with russian as english, or that there is another latin-script dependency that i need to replace somewhere in coolreader itself or something that is breaking the actual TTS part.

if the wordtiming file is good, the bug is probably not in perl or vosk. if the wordtiming file is NOT good, the bug is DEFINITELY in my perl script, or in vosk.

if the wordtiming file is good, and coolreader still is not good but is kinda CLOSE, then the MOST LIKELY problem is that seeking in MP3 files does not work well on android. convert the files to flac (which will have the same quality as the MP3s but take up more room) and re-generate the wordtiming file, and try again.

plotn commented 1 year ago

yes, this is a long-standing bug in TTS.

See my code (DCMD_SELECT_FIRST_SENTENCE), possible I fixed (maybe not very elegant, but seem to work)

try checking the wordtiming file without using coolreader

I need to test your new code, then I'll add some clarifications. I'll be back

teleshoes commented 1 year ago

ok, cool. btw, my fix removed underscores, so im going to add them back in. im in the process of re-generating all my wordtiming files to ensure that supporting multiple scripts didnt change latin in any way

teleshoes commented 1 year ago

there is at least one latin-specific heuristic in coolreader that could change, though it is only for slight improvements in alignment between sentenceinfo and wordtiming, not for generating wordtimings or anything WordTimingAudiobookMatcher.wordsMatch():

            //expensive calculation, but relatively rarely performed
            if(word1.matches(".*[a-z].*") || word2.matches(".*[a-z].*")){
                //if there is at least one letter in the word: compare only letters
                word1 = word1.replaceAll("[^a-z]", "");
                word2 = word2.replaceAll("[^a-z]", "");
            }else{
                //otherwise: compare only numbers
                word1 = word1.replaceAll("[^0-9]", "");
                word2 = word2.replaceAll("[^0-9]", "");
            }
plotn commented 1 year ago

latin-specific heuristic

will you fix it?

teleshoes commented 1 year ago

when i get a chance, sure, but its not actually important

plotn commented 1 year ago

So what :) I have downloaded all the fresh, renamed audio files to better structure (D0101.mp3 and so on). Did all the recognition again. Started A book, all is fine, at the start, but - when reading starts, the sentences became switch one by one, without any pause. And this is stable,.

teleshoes commented 1 year ago

im not quite sure i understand what you're saying is wrong, but try this first: convert your MP3 files to FLAC. there is NO support for accurate MP3 seeking in android media libraries. in order to fix this in coolreader, we would need to read+decode the entire file before playback, which would greatly slow down playback and use a lot of RAM. some fixed-bitrate MP3 files work perfectly fine, but most do not.

have you checked the wordtiming file WITHOUT coolreader? that's the most important diagnostic step to isolate what is wrong

plotn commented 1 year ago

convert your MP3 files to FLAC

not needed, all is correct

nd use a lot of RAM

not a problem for an alghoritm, but sounds good

have you checked the wordtiming

Yes I did. See for example "304.92,сокол,D0101.mp3" - i rewind to this and heared "гол как сокол", the vosk model is very accurate and clear.

So why dont you try to do everything by yourself? I gave all the files and explanations. I think the problem is in coolreader / knownreader sentence / wordtiming mathcing - the algorithm switches sentences quickly one by another - that is the problem, not perl nor python.

I can provide any help with files given, any explanations

teleshoes commented 1 year ago

not needed, all is correct

the reason to use FLAC is a problem in android, with seeking, not in perl. the MP3 issue can definitely cause the problem where it rapidly moves to the next sentence, because it believes the current sentence to be in the future.

I think the problem is in...sentence / wordtiming matching

this is probably true. i still need to load it up on the actual app and try it, but ive been a bit busy recently. ill take a look soon. EDIT: haha, yes, this is almost definitely the problem, i forgot i split sentences in java as well. the fix is non-trivial. (although you should definitely still convert to FLAC, as that is likely to be another problem)

teleshoes commented 1 year ago

(im pushing code, but this is not done yet. i will post here when its ready)

teleshoes commented 1 year ago

i updated this project, and branch audiobook_in_tts in teleshoes:coolreader both are necessary, and you must re-generate wordtiming files AND sentenceinfo files AND build a new coolreader apk (you do NOT need to regenerate the cached vosk JSON files, which is the most time-consuming part)

ive been testing on linux with the java directly, because its hard to test on android. im gonna try your sample ebook now directly on coolreader on android, and see if it seems to work

teleshoes commented 1 year ago

@plotn ok, tested on my actual phone, works perfectly. unfortunately, you DO have to use FLAC and not MP3, or it will always be a half-sentence off. in flac, its exactly precise.

to avoid regenerating the vosk json, you can just use the mp3 to generate the wordtiming+sentenceinfo, then do a find/replace in the wordtiming to replace s/mp3$/flac/, and copy just the fb2+flacs+wt+si (and not mp3s) to your device

teleshoes commented 1 year ago

pushkin_dubrovsky.sentenceinfo.txt pushkin_dubrovsky.wordtiming.txt

for comparison after you generate it. if you convert to flac and name your flac files pushkin_dubrovsky_01.flac (or modify the filenames in my wordtiming to match your filenames), you should get exactly the same output.

plotn commented 1 year ago

Ok! than you. Now is my turn, I will be back

plotn commented 1 year ago

So the result. TI converted to flac, did recognition again, did wordtiming, sentenceInfo, all is good. Begin to read - fine. When it was close to the end of Chapter 1 - the wordmatching fails from this sentence: Screenshot_20230804_000455_KnownReader premium

when I switch to Chapter 2, 3 and so on - reading starts from this sentence always and sentences become to "jump". When I get earlier - everything is ok

teleshoes commented 1 year ago

chances are good that the problem is in the wordtiming. there are three stages that the problem can occur: 1) vosk - the actual text-to-speech is never perfect, although it doesnt need to be perfect to get perfect results. if it, however, fails on enough words, close enough together, you can get bad behaviors 2) ebook vs audiobook - usually, the ebook includes things the audiobook does not, like table of contents, and sometimes the audiobook includes things the ebook does not have. occasionally, a book will have footnotes which the audiobook reads in-line, and the ebook puts at the end. 3) sentenceinfo - this is the bit of coolreader that matches wordtiming to the TTS sentence lists. if you use the same version of coolreader on your PC to generate the wordtiming file that you use to read the ebook on your phone, this SHOULD to be 100% correct, since the wordtiming file is actually generated FROM the sentenceinfo. however, i did just make major changes to it to support cyrillic, so it could easily have a bug.

i will look at that point in the file now and see if i notice anything wrong

teleshoes commented 1 year ago

@plotn actually, it works perfectly for me, from the page you put. i checked a bunch of sentences, back 10 pages and forward 10 pages (middle of chapter 2), and also every 30 pages or so from the start to the end. everything seems perfect. so, my new hypotheses are:

1) you dont have cr3 installed on your command path, so wordtiming is generated using pandoc (pandoc gives wildly different results compared to using coolreader) 2) the version of cr3 you have installed on your command path was not rebuilt since my fix yesterday, and so does not include the fix where i left out numbers (there is a number-only word immediately after your sentence, and so this may be the cause) 3) the order of the files is wrong again somehow

paste your wordtiming and sentenceinfo file and ill take a look

EDIT: oh crap, is the executable name from cr3qt different in knownreader? i can add that to the path-checker. if it is different, that is 100% the problem.

EDIT2: in case that IS the problem, im going to remove pandoc as a fallback, and only use pandoc if the user explicitly types --pandoc on the command line. please paste the executable that you use for the desktop version of knownreader (its cr3 in cr3qt)

EDIT3: knownreader DOES use cr3 as the exec name. still, make sure its on your path and that youre not using pandoc

plotn commented 1 year ago

Yes, I am using pandoc. I tried to use CR3 from KR (with your changes), but it failed and I decided to use more simple way Why not? Look, the whole mechanism look unstable. It is ok if:

If these situations is not often and can be explained by the difference between each others - I, repeat, is ok. But! There should not be the whole unusable situations like mine - swithching from some moment to infinite sentence jumping and starting the audio from only one "incorrrect"moment. We shouldnt think we are living in greenhouse conditions, we should make some more euristics to the process, I think.

plotn commented 1 year ago

my files are: timings.zip

teleshoes commented 1 year ago

1) i just compiled and used knownreader CLI in cr3qt to generate sentenceinfo file. it worked perfectly. 2) There should not be the whole unusable situations haha, if you dont want "whole unusable situations", do not use pandoc.

let me explain:

so, this process already has many tolerances and heuristics added, or it would not work at all. the trick is to balance accuracy with fault tolerance.

in this process, there are three difficult variables:

HOWEVER, matching the EBOOK to the EBOOK should not be one of those variables, and THAT is what is failing for you. it is also that hardest variable to adjust for, because if it doesnt align VERY VERY closely, you can get very bad results. of course, if you use coolreader to produce the list of ebook words, then match against the list of words in coolreader, it will always be 100% correct.

pandoc parses the words very differently from coolreader. i used pandoc at first because i did not realize that the process of extracting words from an EPUB was actually hard, and i also did not realize how easy it would be to add this feature to coolreader.

im going to leave the option for pandoc, but only if you use --pandoc arg, because this program, in theory, should be able to work with other ebook readers. i will NOT support using pandoc WITH coolreader, however, as it is just a waste of time.

but if you need any help making coolreader/knownreader work with this script, ill be more than happy to assist

teleshoes commented 1 year ago

also, i looked at your wordtiming file. you are DEFINITELY using the OLD version of the ebook-audiobook script. make sure you update to the newest, AND switch to cr3

EDIT: by old, i mean you are using the version from july 31st, not the version from august 2nd.

plotn commented 1 year ago

Okay, could you do one thing for me? Can you make generation of parsed book (instead of pandas) not from CR-QT, but from within android java code? And to specify this file in the command line parameter. I, really, dont want to compile and use QT version, I'd better make some button in book's about dialog and call your function to save this file. Then I'll try again with the script renewed

teleshoes commented 1 year ago

1) how would you invoke the android app from the perl script? i dont have any idea how to do this. 2) cd knownreader; mkdir build; cd build; cmake ..; make -j8; sudo make install building coolreader on linux is SO much easier than building the android app..

note: the relevant code that parses the EPUB is in crengine C++, not in android java or in cr3qt C++. the cmdline interface i added for WordTimingAudiobookMatcher.java runs pure java code, and nothing like that will work to run the C++

note2: just to make sure; you are aware that i dont use the QT app right? i just use a tiny little cmdline interface that i stuck on the QT version because that is what compiles by default

plotn commented 1 year ago

Ok, I'll try Give me some more time please. Thank you for help

plotn commented 1 year ago

So cr3 (kr edition, not yours) says to me:

plotn@plotn-thinkbook:~/github/ebook-audiobook-wordtiming/src$ cr3 --loglevel=DEBUG --get-sentence-info /home/plotn/github/knownreader/CRAudioBook/Pushkin_Dubrovsky.fb2 /home/plotn/.cache/ebook-audiobook-wordtiming/ebook-coolreader-sentenceinfo/pushkin_dubrovsky-922e1b62934103cd783b658006cc1907.sentenceinfo 2023/08/06 15:25:22.0537 WARN Changing log level from 3 to 4 2023/08/06 15:25:22.0541 INFO main() 2023/08/06 15:25:22.0575 DEBUG 0 font files found Fatal Error: Cannot open font file(s) .ttf Cannot work without font Continuing... Warning: Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway. 2023/08/06 15:25:22.1501 INFO Using translation file cr3_ru_RU from dir /usr/local/share/cr3/i18n/ Ошибка сегментирования (образ памяти сброшен на диск)

(Segmentation fault (memory dump is saved on disc)

plotn commented 1 year ago

But!!! I did sentence info file from within android app, then used it in parsing and I could proceed! The book is working good!!!!!!

So I decided to write an article at habr.com (I have a couple of articles there already) gaving detailed explanation there and we could continue to make whole process better

teleshoes commented 1 year ago

interesting, your kr fork works fine for me. maybe its because i still use X and not wayland, or maybe the font is an issue. maybe in the future i can add a way to export the sentenceinfo from android as a workaround for not being able to use cr3qt, or add a new target that uses just the bits i need from crengine and is easier to compile and less brittle (though i have never had the issues you seem to have with qt)

anyway, im glad its working!

plotn commented 1 year ago

Can you add a new key for cr3 mode? If set, then not to generate .sentenceinfo file, but get it from path specified. I'll download it from android and use in recognition