martinreynaert / TICCL

Text-Induced Corpus Clean-up
GNU General Public License v3.0
20 stars 2 forks source link

Running the system #1

Closed jmokoistinen closed 8 years ago

jmokoistinen commented 8 years ago

Trying to run the method for Finnish text files, still need some configuration to do. Where do I get TICCL-stats and other files missing (below). I have TICCL-stats.cxx and other cxx files at TICCLtools/src

perl TICCLops.PICCL.pl TICCL.Template.fin.config TICCL_OPTSin: abcmdef TXT ticcl/ data/int/fin/fin.aspell.dict.c10.d2.confusion empty.txt xml 100000000 ticcl/data/int/fin/fin.aspell.dict.lc.chars ticcl/input/ data/int/fin/fin.aspell.dict 2 output/fin TICCLtest 3 fin ticcl/ticcltools/src 12 5 50 TICCL_OPTSin2: MODE: abcmdef TEXTTYPE: TXT ROOTDIR: ticcl/ CHARCONFUS: data/int/fin/fin.aspell.dict.c10.d2.confusion KHC: empty.txt EXT: xml$ ARTIFRQ: 100000000 ALPH: ticcl/data/int/fin/fin.aspell.dict.lc.chars INPUTDIR: ticcl/input/ DIR: LEX: data/int/fin/fin.aspell.dict LD: 2 OUTPUTDIR: output/fin PREFIX: TICCLtest RANK: 3 LANG: fin TOOLDIR: ticcl/ticcltools/src THREADS: 12 MINLENGTH: 5 MAXLENGTH: 50 OUT1: OUT2: output/fin/zzz/TICCL/TICCLtest TICCLops version CLARIN-NL 0.2 RUN_FoLiA-stats1TXT: output/fin/zzz/TICCL/TICCLtest sh: 1: ticcl/ticcltools/src/TICCL-stats: not found mv: cannot stat ‘output/fin/zzz/TICCL/TICCLtest.wordfreqlist.tsv’: No such file or directory RUN_TICCL-unk: output/fin/zzz/TICCL/TICCLtest >> output/fin/zzz/TICCL/TICCLtest.tsv RUN_TICCL-anahash: output/fin/zzz/TICCL/TICCLtest.tsv.clean RUN_TICCL-indexerNT: output/fin/zzz/TICCL/TICCLtest.tsv.clean TIME AFTER PROCESSING CORPUS: 1464785898 minus 1464785898 = 0 >> MIN: 0 >> HOURS: 0 RUN_TICCL-LDcalc: output/fin/zzz/TICCL/TICCLtest RUN_TICCL-rank2: output/fin/zzz/TICCL/TICCLtest

jmokoistinen commented 8 years ago

Actually I need to compile ticcltools first, but it requires ticcutils, and now current problem for configuring ticcutils is

... checking for working vfork... (cached) yes ./configure: line 15990: syntax error near unexpected token ,AC_MSG_ERROR' ./configure: line 15990:ACX_PTHREAD(,AC_MSG_ERROR([We need pthread support!]))'

martinreynaert commented 8 years ago

Dear Mika,

I hereby CC: my colleague Maarten.

He tells me you now probably lack: autoconf-archive

He is also working on providing handier installation procedures, I will inform you about these as soon as they are available.

Thank you for trying TICCL!

Best regards,

Martin

On 01/06/16 15:29, Mika Koistinen wrote:

Actually I need to compile ticcltools first, but it requires ticcutils, and now current problem for configuring ticcutils is

... checking for working vfork... (cached) yes ./configure: line 15990: syntax error near unexpected token |,AC_MSG_ERROR' ./configure: line 15990:|ACX_PTHREAD(,AC_MSG_ERROR([We need pthread support!]))'

proycon commented 8 years ago

I have now added TICCL as an optional extra in LaMachine, our software distribution (https://proycon.github.io/LaMachine). It will automatically compile and install all dependencies, including ticcltools, ticcutils, libfolia, etc...

Note that you'll have to explicitly opt-in for TICCL as it downloads some larger dependencies (lexicons for all languages are included automatically). Consult the Updating & Extra Software section of the aforementioned LaMachine website.

jmokoistinen commented 8 years ago

-Problem opening: .tsv.clean.confuslist.indexNT. What is still missing?

Current listing after running on command line:

perl TICCLops.PICCL.pl TICCL.Template.fin.config TICCL_OPTSin: abcmdef TXT ticcl/ data/int/fin/fin.aspell.dict.c10.d2.confusion empty.txt xml 100000000 ticcl/data/int/fin/fin.aspell.dict.lc.chars input/ data/int/fin/fin.aspell.dict 2 output/fin TICCLtest 3 fin ticcltools/src 12 5 50 TICCL_OPTSin2: MODE: abcmdef TEXTTYPE: TXT ROOTDIR: ticcl/ CHARCONFUS: data/int/fin/fin.aspell.dict.c10.d2.confusion KHC: empty.txt EXT: xml$ ARTIFRQ: 100000000 ALPH: ticcl/data/int/fin/fin.aspell.dict.lc.chars INPUTDIR: input/ DIR: LEX: data/int/fin/fin.aspell.dict LD: 2 OUTPUTDIR: output/fin PREFIX: TICCLtest RANK: 3 LANG: fin TOOLDIR: ticcltools/src THREADS: 12 MINLENGTH: 5 MAXLENGTH: 50 OUT1: OUT2: output/fin/zzz/TICCL/TICCLtest TICCLops version CLARIN-NL 0.2 RUN_FoLiA-stats1TXT: output/fin/zzz/TICCL/TICCLtest RUN_TICCL-unk: output/fin/zzz/TICCL/TICCLtest >> output/fin/zzz/TICCL/TICCLtest.tsv RUN_TICCL-anahash: output/fin/zzz/TICCL/TICCLtest.tsv.clean unable to open alphabet file: ticcl/data/int/fin/fin.aspell.dict.lc.chars RUN_TICCL-indexerNT: output/fin/zzz/TICCL/TICCLtest.tsv.clean problem opening anagram hash file: output/fin/zzz/TICCL/TICCLtest.tsv.clean.anahash TIME AFTER PROCESSING CORPUS: 1464848876 minus 1464848868 = 8 >> MIN: 0.133333333333333 >> HOURS: 0.00222222222222222 RUN_TICCL-LDcalc: output/fin/zzz/TICCL/TICCLtest problem opening: .tsv.clean.confuslist.indexNT RUN_TICCL-rank2: output/fin/zzz/TICCL/TICCLtest

martinreynaert commented 8 years ago

Dear Mika,

The file ticcl/data/int/fin/fin.aspell.dict.lc.chars does not seem to be in place.

You should have these files:

$ ls -l data/int/fin total 12628 -rw-r--r-- 1 mre ticc_users 10444287 Jul 7 2014 fin.aspell.dict -rw-r--r-- 1 mre ticc_users 2478604 Jul 7 2014 fin.aspell.dict.c10.d2.confusion -rw-r--r-- 1 mre ticc_users 732 Jul 7 2014 fin.aspell.dict.lc.chars

and they should be located in the directory where TICCLops.PICCL.pl is run.

Hope this helps!

Best,

Martin

On 02/06/16 08:33, Mika Koistinen wrote:

-Problem opening: .tsv.clean.confuslist.indexNT. What is still missing?

Current listing after running on command line:

perl TICCLops.PICCL.pl TICCL.Template.fin.config TICCL_OPTSin: abcmdef TXT ticcl/ data/int/fin/fin.aspell.dict.c10.d2.confusion empty.txt xml 100000000 ticcl/data/int/fin/fin.aspell.dict.lc.chars input/ data/int/fin/fin.aspell.dict 2 output/fin TICCLtest 3 fin ticcltools/src 12 5 50 TICCL_OPTSin2: MODE: abcmdef TEXTTYPE: TXT ROOTDIR: ticcl/ CHARCONFUS: data/int/fin/fin.aspell.dict.c10.d2.confusion KHC: empty.txt EXT: xml$ ARTIFRQ: 100000000 ALPH: ticcl/data/int/fin/fin.aspell.dict.lc.chars INPUTDIR: input/ DIR: LEX: data/int/fin/fin.aspell.dict LD: 2 OUTPUTDIR: output/fin PREFIX: TICCLtest RANK: 3 LANG: fin TOOLDIR: ticcltools/src THREADS: 12 MINLENGTH: 5 MAXLENGTH: 50 OUT1: OUT2: output/fin/zzz/TICCL/TICCLtest TICCLops version CLARIN-NL 0.2 RUN_FoLiA-stats1TXT: output/fin/zzz/TICCL/TICCLtest RUN_TICCL-unk: output/fin/zzz/TICCL/TICCLtest >> output/fin/zzz/TICCL/TICCLtest.tsv RUN_TICCL-anahash: output/fin/zzz/TICCL/TICCLtest.tsv.clean unable to open alphabet file: ticcl/data/int/fin/fin.aspell.dict.lc.chars RUN_TICCL-indexerNT: output/fin/zzz/TICCL/TICCLtest.tsv.clean problem opening anagram hash file: output/fin/zzz/TICCL/TICCLtest.tsv.clean.anahash TIME AFTER PROCESSING CORPUS: 1464848876 minus 1464848868 = 8 >> MIN: 0.133333333333333 >> HOURS: 0.00222222222222222 RUN_TICCL-LDcalc: output/fin/zzz/TICCL/TICCLtest problem opening: .tsv.clean.confuslist.indexNT RUN_TICCL-rank2: output/fin/zzz/TICCL/TICCLtest

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/martinreynaert/TICCL/issues/1#issuecomment-223207462, or mute the thread https://github.com/notifications/unsubscribe/AEVTU6a2F8PKDLm3g4ybDoaUPbE8N3Djks5qHnlXgaJpZM4IrhSM.

proycon commented 8 years ago

If you use LaMachine, it installs these files in the TICCL source directory (src/TICCL/data)

martinreynaert commented 8 years ago

Dear Mika,

Never mind!

Do not hesitate to contact me if you want any further information or advice about TICCL!

Yours,

Martin

On 02/06/16 16:25, Mika Koistinen wrote:

Oh sorry wrong forum :)

On Thu, Jun 2, 2016 at 5:22 PM, Mika Koistinen j.m.o.koistinen@gmail.com wrote:

Hello again

This is how my training looks like ? any comments how to improve the training? Currently using 537939 words for 1. initializing language model and 2. initializing a font for finnish language.. and for phase 3. training font 3 I just use 3 images currently.

Writing transcription output to

train_output/all_transcriptions/finnish/fk01442_1836-01-01_3_7_iter-1_transcription.txt Writing comparisons to

train_output/all_transcriptions/finnish/fk01442_1836-01-01_3_7_iter-1_comparisons.txt Writing alto output to

train_output/all_transcriptions/finnish/fk01442_1836-01-01_3_7_iter-1.alto.xml Multiple languages being used (1), so an html file is being generated to show language switching. Writing html output to

train_output/all_transcriptions/finnish/fk01442_1836-01-01_3_7_iter-1.html Training iteration 1 of 3, document 2 of 3: sample_images/finnish/fk03945_1861-08-01_8_1.jpg 2016/06/02 15:35:14 Evaluation diplomatic text found at sample_images/finnish/fk03945_1861-08-01_8_1.txt No evaluation normalized text found at sample_images/finnish/fk03945_1861-08-01_8_1_normalized.txt (This is only a problem if you were trying to provide a gold normalized transcription to check accuracy.) Extracting text line images from sample_images/finnish/fk03945_1861-08-01_8_1.jpg Extractor returned 30 line images Batch: 0 Initializing EmissionModel 2016/06/02 15:35:44 Rebuilding cache 2016/06/02 15:35:44 Rebuild emission cache: 889246ms Estimated emission cache size: 0.652gb Done rebuilding cache 2016/06/02 15:50:33 Constructing forwardTransitionModel Using OnlyOneLanguageCodeSwitchLM and CharacterNgramTransitionModel Ready to run decoder Decoding..............................Done running decoder Ready to run increment counts Increment counts: 932ms Decode: 400ms

88Фго) IЛ’ºг„ ⅗

m

( П ( ( П ( ( ( ( (

П

( ( П ( П ( (

(

(

Writing transcription output to

train_output/all_transcriptions/finnish/fk03945_1861-08-01_8_1_iter-1_transcription.txt Writing comparisons to

train_output/all_transcriptions/finnish/fk03945_1861-08-01_8_1_iter-1_comparisons.txt Writing alto output to

train_output/all_transcriptions/finnish/fk03945_1861-08-01_8_1_iter-1.alto.xml Multiple languages being used (1), so an html file is being generated to show language switching. Writing html output to

train_output/all_transcriptions/finnish/fk03945_1861-08-01_8_1_iter-1.html Training iteration 1 of 3, document 3 of 3: sample_images/finnish/fk03945_1868-01-01_1_5.jpg 2016/06/02 15:50:39 Evaluation diplomatic text found at sample_images/finnish/fk03945_1868-01-01_1_5.txt No evaluation normalized text found at sample_images/finnish/fk03945_1868-01-01_1_5_normalized.txt (This is only a problem if you were trying to provide a gold normalized transcription to check accuracy.) Extracting text line images from sample_images/finnish/fk03945_1868-01-01_1_5.jpg Extractor returned 39 line images Batch: 0 Initializing EmissionModel 2016/06/02 15:51:17 Rebuilding cache 2016/06/02 15:51:17 Rebuild emission cache: 939599ms Estimated emission cache size: 0.666gb Done rebuilding cache 2016/06/02 16:06:57 Constructing forwardTransitionModel Using OnlyOneLanguageCodeSwitchLM and CharacterNgramTransitionModel Ready to run decoder Decoding................................Done running decoder Ready to run increment counts Increment counts: 810ms Batch: 1 Initializing EmissionModel 2016/06/02 16:07:03 Rebuilding cache 2016/06/02 16:07:03 Rebuild emission cache: 195331ms Estimated emission cache size: 0.149gb Done rebuilding cache 2016/06/02 16:10:19 Constructing forwardTransitionModel Using OnlyOneLanguageCodeSwitchLM and CharacterNgramTransitionModel Ready to run decoder Decoding.......Done running decoder Ready to run increment counts Increment counts: 253ms Decode: 1869ms

uőд3

(

П

( ( (

(

(

m

( ( ( ( (

tºг

(

(

⅛ (

Writing transcription output to

train_output/all_transcriptions/finnish/fk03945_1868-01-01_1_5_iter-1_transcription.txt Writing comparisons to

train_output/all_transcriptions/finnish/fk03945_1868-01-01_1_5_iter-1_comparisons.txt Writing alto output to

train_output/all_transcriptions/finnish/fk03945_1868-01-01_1_5_iter-1.alto.xml Multiple languages being used (1), so an html file is being generated to show language switching. Writing html output to

train_output/all_transcriptions/finnish/fk03945_1868-01-01_1_5_iter-1.html Update font parameters: 3010ms Writing updated font to train_output/font/retrained_iter-1_batch-1.fontser Clearing font parameter statistics. Completed Batch: Iteration 1, batch 1: avg joint log prob: Infinity 2016/06/02 16:10:30 Iteration 1 avg joint log prob: Infinity

train_output/all_transcriptions/finnish/eval_iter-1_diplomatic.txt All evals: Document: sample_images/finnish/fk01442_1836-01-01_3_7.jpg CER, keep punc: 0.9726345083487941 CER, keep punc, allow f->s: 0.9726345083487941 CER, remove punc: 0.9975272007912958 CER, remove punc, allow f->s: 0.9975272007912958 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0

Document: sample_images/finnish/fk03945_1861-08-01_8_1.jpg CER, keep punc: 0.9803695150115473 CER, keep punc, allow f->s: 0.9803695150115473 CER, remove punc: 0.9934406678592725 CER, remove punc, allow f->s: 0.9934406678592725 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0

Document: sample_images/finnish/fk03945_1868-01-01_1_5.jpg CER, keep punc: 0.9793866264454499 CER, keep punc, allow f->s: 0.9793866264454499 CER, remove punc: 0.9963215974776668 CER, remove punc, allow f->s: 0.9963215974776668 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.996309963099631 WER, remove punc, allow f->s: 0.996309963099631

Macro-avg total eval: CER, keep punc: 0.9774635499352637 CER, keep punc, allow f->s: 0.9774635499352637 CER, remove punc: 0.9957631553760783 CER, remove punc, allow f->s: 0.9957631553760783 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.998769987699877 WER, remove punc, allow f->s: 0.998769987699877

Training iteration: 2 2016/06/02 16:10:30 Training iteration 2 of 3, document 1 of 3: sample_images/finnish/fk01442_1836-01-01_3_7.jpg 2016/06/02 16:10:30 Batch: 0 Initializing EmissionModel 2016/06/02 16:10:32 Rebuilding cache 2016/06/02 16:10:32 Rebuild emission cache: 988541ms Estimated emission cache size: 0.823gb Done rebuilding cache 2016/06/02 16:27:00 Constructing forwardTransitionModel Using OnlyOneLanguageCodeSwitchLM and CharacterNgramTransitionModel Ready to run decoder Decoding................................Done running decoder Ready to run increment counts Increment counts: 1253ms Batch: 1 Initializing EmissionModel 2016/06/02 16:27:08 Rebuilding cache 2016/06/02 16:27:08 Rebuild emission cache: 646383ms Estimated emission cache size: 0.495gb Done rebuilding cache 2016/06/02 16:37:54 Constructing forwardTransitionModel Using OnlyOneLanguageCodeSwitchLM and CharacterNgramTransitionModel Ready to run decoder Decoding...................Done running decoder Ready to run increment counts Increment counts: 566ms Decode: -82ms

Thanks! BR, Mika

On Thu, Jun 2, 2016 at 12:13 PM, Maarten van Gompel < notifications@github.com> wrote:

If you use LaMachine, it installs these files in the TICCL source directory (src/TICCL/data)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/martinreynaert/TICCL/issues/1#issuecomment-223238031, or mute the thread

https://github.com/notifications/unsubscribe/AGoRPthTUdq4-yzrH8Xv3KfsuaeSMZRKks5qHp7KgaJpZM4IrhSM .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/martinreynaert/TICCL/issues/1#issuecomment-223307906, or mute the thread https://github.com/notifications/unsubscribe/AEVTU1erdWrdU8MiuqXjlnq8uaitnkL8ks5qHufIgaJpZM4IrhSM.

jmokoistinen commented 8 years ago

My input is txt files or xml files, how and where do i get the corrected results for each of these files? Below current contents of my output/TICCL

TICCLtest.lst TICCLtest.tsv.clean.ldcalc TICCLtest.tsv TICCLtest.tsv.clean.ldcalc.ranked TICCLtest.tsv.clean TICCLtest.tsv.punct TICCLtest.tsv.clean.anahash TICCLtest.tsv.unk TICCLtest.tsv.clean.corpusfoci

BR, Mika

martinreynaert commented 8 years ago

Dear Mika,

Great question!

Sorry: we do not currently have an actual 'corrector' for plain text or non-FoLiA XML files.

How we do correct FoLiA XML files is well explained in the FoLiA manual at https://proycon.github.io/folia/ . I could easily send you an example file, if you so wish. The module that does do this is FoLiA-correct, which you have in the ticcl tools directory.

The file TICCLtest.tsv.clean.ldcalc.ranked for you now holds the final result of the TICCL OCR-postcorrection process. It should have mainly non-words to the left, coupled with correction candidates to the right. In my recent LREC paper I explain the rest of the annotations.

In this file you should find largely non-words to the left linked to ranked correction candidates on the right. Best correction is ranked first. Correcting your texts should then be a matter of finding words on the left and replacing them by the first candidate on the right... Or perhaps retaining the 'non-word' and adding up to x correction candidates in some annotation? Say: ... NON-WORD [ correctioncandidate1 | correctioncandidate2 | correctioncandidate3 ] ?? We could quite easily offer you some such tool.

As for non FoLiA XML... Does your Library perhaps rely on Alto XML? If so, I am currently looking into an Alto XML correction system that has been developed at the National Library here in the Netherlands. I need some more time on this, though.

Thank you!

Yours,

Martin

On 07/06/16 08:05, Mika Koistinen wrote:

My input is txt files or xml files, how and where do i get the corrected results for each of these files? Below current contents of my output/TICCL

TICCLtest.lst TICCLtest.tsv.clean.ldcalc TICCLtest.tsv TICCLtest.tsv.clean.ldcalc.ranked TICCLtest.tsv.clean TICCLtest.tsv.punct TICCLtest.tsv.clean.anahash TICCLtest.tsv.unk TICCLtest.tsv.clean.corpusfoci

BR, Mika

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/martinreynaert/TICCL/issues/1#issuecomment-224187703, or mute the thread https://github.com/notifications/unsubscribe/AEVTU-Z6uLWn2mKFlZbHGhs8yJuamHyRks5qJQorgaJpZM4IrhSM.

jmokoistinen commented 8 years ago

Yes, we are using Alto XML format here at the National Library of Finland. Would be interested hearing more/trying the method from National Library of Netherlands.

Another question, is there a way to create own character confusion (and fin.aspell.dict.lc.chars) file by using our texts and ground truth for example?

BR, Mika

martinreynaert commented 8 years ago

Dear Mika,

I will get back to you about the Alto. I have not yet had the chance to take a look at it all myself.

Yes, you can create your own character confusion files. For this you use the following tool: TICCL-lexstat.

$ /exp/sloot/usr/local/bin/TICCL-lexstat -h Usage: TICCL-lexstat [options] dictionary TICCL-lexstat will create a lowercased character frequency list from a dictionary file. -h this message --diac produces an extra diacritics confusion file (extension .diac) --clip 'clip' truncates the character file at frequency 'clip' --LD depth 1, 2 or 3. (default 2): The characterlength of the confusions. When LD=0 only a frequency list is generated. --all : full output. Show ALL variants in the confusions file. Normally only the first is shown. -V show version

This should be included in your package.

The idea is that first you do a trial run without using any of the parameters. This will give you a frequency list of the characters in your lexicon.

Some characters may have entered through loan words and may have a very low frequency. You can exclude these from your alphabet by using the --clip option. They will still be handled properly by TICCL, but the overall search space and number of character confusions will be reduced.

You obtain the character confusion list by specifying the LD.

Succes!

Best regards,

Martin

On 08/06/16 10:38, Mika Koistinen wrote:

Yes, we are using Alto XML format here at the National Library of Finland. Would be interested hearing more/trying the method from National library of Netherlands.

Another question, is there a way to create character confusion file from our own dictionary?

BR, Mika

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/martinreynaert/TICCL/issues/1#issuecomment-224525345, or mute the thread https://github.com/notifications/unsubscribe/AEVTU_dr5V9njzGe3JLqiF7D2aU4rFw9ks5qJn99gaJpZM4IrhSM.

jmokoistinen commented 8 years ago

Hi is there some maximum size /row amount for confusion? Tried running with much larger file than earlier confusion, and got empty results at the final step.

Thanks!

martinreynaert commented 8 years ago

Dear Mika,

I am not aware of any such limit.

Can you please give us some more information?

What output files did you get? The final step before correction is TICCL-rank. Was that output file simply empty?

How much bigger is your character confusion file now? How many lines is it? You were trying Levenshtein distance 3, or even 4 maybe?

Thank you!

Greetings,

Martin

On 20/06/16 14:42, Mika Koistinen wrote:

Hi is there some maximum size /row amount for confusion? Tried running with much larger file than earlier confusion, and got empty results at the final step.

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/martinreynaert/TICCL/issues/1#issuecomment-227131227, or mute the thread https://github.com/notifications/unsubscribe/AEVTU2sPs8mjuKUoexLYGaE07ld7sHpoks5qNoqkgaJpZM4IrhSM.