vanlummelhuizen / ASL-ELAN

2 stars 0 forks source link

Enhance corpus updater script #1

Closed ghost closed 7 years ago

ghost commented 8 years ago

Every night, Signbank creates a fresh ECV to reflect any changes in the lexical database. Changes in glosses in the corpus, however, are only updated when eaf files are opened and ELAN downloads the new ECV. This can lead to outdated search results and frequency counts.

elan-ecv-updater.zip (source code with Han) does two things:

Elan-ecv-updater.jar depends on xercesImpl-2.11.0.jar and xml-apis-1.4.01.jar. It is called by

java -Xmx600m -cp elan-ecv-updater.jar;xercesImpl-2.11.0.jar;xml-apis-1.4.01.jar mpi.eudico.client.annotator.util.CorpusECVUpdater -L nld Inputfolder Outputfolder

where -Xmx600m specifies the java heap space at 600MB, and -L nld enforces Dutch as the content language.

[ works well, but there is room for improvement. Ideally this script should run unsupervised every night, so that the whole corpus reflects the current state of the information in the lexical database. There already is an svn-checkout on the Signbank server. --> moved to separate issue, OC 10 Jun 16]

Issues/enhancements:

ghost commented 8 years ago

A few more remarks:

vanlummelhuizen commented 8 years ago

Now only changes in annotations are considered saveable. The changes are now outputted to stdout. Example:

CHANGE: TIER: GlossR S2 - ANNOTATION_ID: a210 !! ANNOTATION_VALUE changed: LATER-C => VAN-TOT

Use a pipe (| or >) for further processing. I could implement a way to save this info to a file where the filename is a commandline argument.

You can test all this by downloading the latest 4.9.3-jar and call with the command

java -Xmx600m -cp elan-4.9.3.jar:xercesImpl-2.11.0.jar:xml-apis-1.4.01.jar mpi.eudico.client.annotator.util.CorpusECVUpdater -L nld Inputfolder Outputfolder

Next I will try to find a convenient place in the ELAN-GUI this function can reside. Suggestions?

ocrasborn commented 8 years ago

File > Multiple File Processing?

Dr. O.A. Crasborn Department of Linguistics & Centre for Language Studies Radboud University, The Netherlands

http://www.ru.nl/sign-lang http://www.ru.nl/gebarentaal http://www.gebareninzicht.nl

On 04 Apr 2016, at 12:51, Micha Hulsbosch notifications@github.com wrote:

Now only changes in annotations are considered saveable. The changes are now outputted to stdout. Example:

CHANGE: TIER: GlossR S2 - ANNOTATION_ID: a210 !! ANNOTATION_VALUE changed: LATER-C => VAN-TOT

Use a pipe (| or >) for further processing. I could implement a way to save this info to a file where the filename is a commandline argument.

You can test all this by downloading the latest 4.9.3-jar and call with the command

java -Xmx600m -cp elan-4.9.3.jar:xercesImpl-2.11.0.jar:xml-apis-1.4.01.jar mpi.eudico.client.annotator.util.CorpusECVUpdater -L nld Inputfolder Outputfolder

Next I will try to find a convenient place in the ELAN-GUI this function can reside. Suggestions?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

vanlummelhuizen commented 8 years ago

Seems like a cosy spot. I will do that.

ghost commented 8 years ago

Alas, I get the following error:

Error: Could not find or load main class mpi.eudico.client.annotator.util.CorpusECVUpdater

Op 4-4-2016 om 12:51 schreef Micha Hulsbosch:

Now only changes in annotations are considered saveable. The changes are now outputted to stdout. Example:

CHANGE: TIER: GlossR S2 - ANNOTATION_ID: a210 !! ANNOTATION_VALUE changed: LATER-C => VAN-TOT

Use a pipe (| or >) for further processing. I could implement a way to save this info to a file where the filename is a commandline argument.

You can test all this by downloading the latest 4.9.3-jar and call with the command

|java -Xmx600m -cp elan-4.9.3.jar:xercesImpl-2.11.0.jar:xml-apis-1.4.01.jar mpi.eudico.client.annotator.util.CorpusECVUpdater -L nld Inputfolder Outputfolder|

Next I will try to find a convenient place in the ELAN-GUI this function can reside. Suggestions?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/vanlummelhuizen/ASL-ELAN/issues/1#issuecomment-205239023

vanlummelhuizen commented 8 years ago

Oops, I did not upload the jar properly. Sorry. Please try again.

ghost commented 8 years ago

Still no luck, same error message. Elan 4.9.3 jar, right?

Op 4-4-2016 om 14:39 schreef Micha Hulsbosch:

Oops, I did not upload the jar properly. Sorry. Please try again.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/vanlummelhuizen/ASL-ELAN/issues/1#issuecomment-205280256

vanlummelhuizen commented 8 years ago

Last quick idea: the part of the command after -cp has : (colon) as separator on Linux and OSX, while Windows has ; (semi-colon). So if you use my command on Windows, replace : by ;

ghost commented 8 years ago

Good catch. I compared your command with Han's and saw what had changed, but I overlooked the (semi-)colons.

Running now, will post back with findings.

Op 4-4-2016 om 14:49 schreef Micha Hulsbosch:

Last quick idea: the part of the command after -cp has : (colon) as separator on Linux and OSX, while Windows has ; (semi-colon). So if you use my command on Windows, replace : by ;

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/vanlummelhuizen/ASL-ELAN/issues/1#issuecomment-205283174

ghost commented 8 years ago

Very nice, seems to run more smoothly than the previous version. A few remarks though:

vanlummelhuizen commented 8 years ago

Your first point was easily fixable: instead of letting ELAN write an unchanged EAF to disk (with changed DATE) I simply let the program copy the file from source folder to destination folder. Just removing this will fix it. Could be a commandline argument perhaps.

Your second point is also fixed now.

I also incorporated this tool in the ELAN-GUI: File > Multiple File Processing > Update transcriptions for ECVs... Only the language setting is not yet available here. Comming soon.

vanlummelhuizen commented 8 years ago

First attempt at making the language setting available is now in the jar. It is simply a textfield in which e.g. 'eng' or 'nld' should be typed (similar to the -L switch on the commandline). I will think about the possibility to make it a dropdown list, although it either involves a complete list of languages or a list of available languages that is extracted from the files in the chosen folder, which might take a while.

ghost commented 8 years ago

Some comments on the GUI:

ghost commented 8 years ago

I tested the script with the three possibilities: 1) an annotation value is changed in Signbank (update value in eafs), 2) a CVE_REF needs to be added (gloss exists in SB, add ref in eafs) 3) a CVE_REF will be deleted (gloss does not exist anymore in SB, delete ref from eafs)

This is an excerpt of the results file:

CHANGE: TIER: GlossR S2 - ANNOTATION_ID: a428 !! ANNOTATION_VALUE changed: KERMIS => KERMIS-A
Processed file: CNGT0050.eaf
CHANGE: TIER: GlossL S2 - ANNOTATION_ID: a447 - ANNOTATION_VALUE: PT !! CVE_REF added: gloss2109 - PT
CHANGE: TIER: GlossR S1 - ANNOTATION_ID: a12 - ANNOTATION_VALUE: LATER-A !! CVE_REF removed: gloss4903
Processed file: CNGT0120.eaf

All works fine, the only thing that is missing in the output is which file was processed after the addition of a CVE-REF (second line, in this case CNGT0116.eaf)

vanlummelhuizen commented 8 years ago

Download the latest jar to test.

Now up:

vanlummelhuizen commented 8 years ago

In the latest jar the recursion setting is working. There is now also a progressbar. The bar itself is not really informative because the tool recursively walks through a folder tree, not knowing how may files there are yet to process. So it only shows there is something happening.

vanlummelhuizen commented 8 years ago

'specify a language'-> show languages in current transcription

ghost commented 8 years ago

I did another test run. I think Java Heap Size is not large enough, because ELAN became as good as unresponsive, and the 'update transcriptions for ECVs' window closed and reopened a couple of times. In the end there was an empty 'Proces result' (or something) window open, that only showed its contents after mouse-selecting the containing text. It said it stopped with no errors, although it had not processed all directories (I could tell because I wrote changes to a different place). There was another window with what appeared to contain changes to glosses, but I couldn't copy its content (javaw.exe was already using more than 1 GB of memory). No files were written though, only empty directories were made.

ghost commented 8 years ago

Oh, and after closing ELAN I opened and closed it once again, and the log now only contains information from that last opening and closing...

ghost commented 8 years ago

Well now breaks my wooden shoe. I re-ran a test to see if I could salvage a log file, and now everything runs smoothly. Processing speed is good, process ends with a window containing all changes (note: a save-button would be handy), and changed files are written to a different location as specified. Javaw.exe is using 868.212 kB of memory after completion. If I run the process again without closing ELAN first then the whole thing comes to a stand still almost immediatly; I suppose this is what happened yesterday, that I started the process twice, somehow. (Can't this memory be freed? This can also be a problem when opening too many eaf files)

My findings:

vanlummelhuizen commented 8 years ago

Improvements:

Download the new jar to test.

Up next: the memory problems

vanlummelhuizen commented 8 years ago

Got a hint: Error while processing files: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded

GC = garbage collection. It finds and removes (collects) unused stuff (garbage). Apparently it cannot cope with this many files in such short time. I am going to search for a solution.

ghost commented 8 years ago

Works now as expected. Tested again with the three possibilities: 1) an annotation value is changed in Signbank (update value in eafs), 2) a CVE_REF needs to be added (gloss exists in SB, add ref in eafs) 3) a CVE_REF will be deleted (gloss does not exist anymore in SB, delete ref from eafs)

Result (exerpt):

FILE: D:\temp\eaf\CNGT0000-CNGT0099\CNGT0018.eaf
TIER: GlossR S1 - ANNOTATION_ID: a183 - ANNOTATION_VALUE: KRANT-B !! CVE_REF added: gloss2073 - KRANT-B
TIER: GlossR S1 - ANNOTATION_ID: a184 - ANNOTATION_VALUE: KRANT-A !! CVE_REF added: gloss1736 - KRANT-A
TIER: GlossR S1 - ANNOTATION_ID: a185 - ANNOTATION_VALUE: GOED-A !! CVE_REF removed: gloss4783
TIER: GlossR S1 - ANNOTATION_ID: a186 !! ANNOTATION_VALUE changed: GRONINGEND => GRONINGEN

A minor cosmetic issue may be that FILE: D:\temp\eaf\CNGT0000-CNGT0099\CNGT0018.eaf is written also for files that did not change. But as far as I'm concerned, we're good to go :-)

ocrasborn commented 8 years ago

@vanlummelhuizen , what determines the default language if I select none in the updater Window, as below? screenshot 2016-06-10 10 02 55

vanlummelhuizen commented 8 years ago

@ocrasborn The default language is the language of the tier. If the tier does not have a language specification, nothing happens.

vanlummelhuizen commented 8 years ago

Ok, so far the memory problems when running the updater can only be tackled by enlarging the available memory in parallel with amount of files to be processed. Even though the process is really going sequentially through all files. Somehow a lot of unused data is kept in memory, while the Java-mechanism for freeing memory (using 'garbage collection') is not freeing enough. I cannot find why this happens.

Since it has already taken quite a lot of time, I will let it rest for now.

vanlummelhuizen commented 8 years ago

Revisit. The memory problems have been solved (for a while now). There is also already a cronjob on the server, but that was not working correctly. I probably can also solve that by adding one character in the cronjob-script. Perhaps I wait to do so until @ocrasborn is back from South Korea.