openstenoproject / plover

Open source stenotype engine
http://opensteno.org/plover
GNU General Public License v2.0
2.34k stars 281 forks source link

import and convert dictionaries from other programs #39

Closed balshetzer closed 11 years ago

balshetzer commented 12 years ago

Many Plover users have steno experience with other programs and therefore have mature dictionaries in those programs' formats. A tool should exist to easily convert other programs' dictionaries to the Plover dictionary format.

stenoknight commented 12 years ago

Someone's also mentioned the possibility of a steno student starting on Plover and being able to export their Plover dictionary to rtf/cre format so that they can use it with commercial steno software. That's a pretty low priority, in my opinion, but something to consider as an additional feature in the conversion script.

stenoknight commented 12 years ago

Some steps towards converting from DigitalCAT-formatted rtf/cre to Eclipse-formatted rtf/cre, which will make it convertible to Plover's json format. (Thanks to Ed)

//============Optional dual installation comments start============\

My install of ploverwin ver 212 dated approx April 16, 2012 uses a dict format setting of "DCAT" in the config file. It works well for words (but not numbers and not commands).

My next polverwin ver install after that was ver 220 dated approx June 4, 2012. Ploverwin ver 220 has errors using the DC format, so it must use the dict format setting of "Eclipse" in the config.json file.

After I installed ver 220, I still had ver 212 installed, so I had 2 separate ploverwin installations, at 2 different locations, using 2 "different Kim dictionaries", using 2 different dictionary configuration settings.

Note that I run only a single version of ploverwin at a time.

Also note that when I say "different Kim dictionaries" I mean different names, different locations, different config settings, but the actual dictionary entries started out being exactly the same (prior to the start of the conversion change process).

This dual installation situation helped me to experiment because I could easily test for word output differences between that DC supported version and the Eclipse only version.

\============Optional dual installation comments end====================//


When Plover refuses to output English due to steno in the wrong format, it often outputs the correctly formatted steno that it wants to see in the dictionary.

This often happens with words and also with numbers.

I looked in your dictionary, Mirabai, to see if certain character combinations were present or not. If they were NOT present in your dictionary -- then I removed those character combinations from Kim's DC format dictionary

(that I was trying to convert from the DC to the Eclipse format) by editing.

I am a novice user of TED Notepad. http://jsimlo.sk/notepad/download.php

This is my 1st contact with regular expressions.

My editing method is: "find & replace all", case sensitive on, regex off. But before "find & replace all", I use "find", case sensitive on, regex on to test to see if a "find & replace all" will do what I want it to do,

without changing anything else (I want the English right side unchanged).

To search only the English (right side only), I use the following Perl regex: ": ".#.$

(# is now representing what I am searching for.)

To search only the steno (left side only), I use the following Perl regex: ^.#.": (Note that there is a space after the colon.)

(# is now representing what I am searching for.)

I don't know if it matters in what order the changes are done. I usually change one character combination at a time. My 1st change was "-" to "". My 2nd change was "-E" to "E".

My 3rd change was "-U" to "U".

As you see below, BU files have names according to their changes. (If anyone wants the ~654 megs of BU files I will gladly send them.) Here are directory listings of dictionary BU files sorted by time:

Directory of C:\PloverDictBUsForMirabai1

05/26/2012 04:03 AM 7,593,697 dictDcat001.json Used with ver212 DC format 06/06/2012 12:49 PM 7,593,852 dict.jsonDc04 Added PloverToggle for ver220 06/06/2012 02:15 PM 7,535,724 dict.jsonDc05StarHyphenStar ("-" to "") 1st edit to convert from dc to eclipse format 06/06/2012 03:21 PM 7,527,938 dict.jsonDc06-EE 06/06/2012 06:16 PM 7,526,967 dict.jsonDc07-UU 06/07/2012 06:27 AM 7,515,531 dict.jsonDc08O-ROR(3) (3 English entries to manually change back after find & replace) 06/07/2012 04:13 PM 7,491,127 dict.jsonDc09O-UOU 06/07/2012 05:02 PM 7,488,026 dict.jsonDc10A-TAT 06/07/2012 05:18 PM 7,466,532 dict.jsonDc11A-PAP(3) 06/07/2012 05:39 PM 7,458,785 dict.jsonDc12A-BAB(2) 06/07/2012 05:54 PM 7,456,626 dict.jsonDc13A-DAD(1) 06/07/2012 07:07 PM 7,419,784 dict.jsonDc14A-EAE 06/07/2012 07:17 PM 7,416,467 dict.jsonDc15A-FAF 06/07/2012 07:56 PM 7,415,002 dict.jsonDc16A-GAG(4) (4 English entries to manually change back after find & replace) 06/07/2012 11:15 PM 7,407,436 dict.jsonDc17A-LAL 06/08/2012 01:17 AM 7,396,034 dict.jsonDc18A-RAR 06/08/2012 01:31 AM 7,393,193 dict.jsonDc19A-SAS(1) 06/08/2012 01:41 AM 7,386,891 dict.jsonDc20A-UAU 06/08/2012 01:53 AM 7,386,011 dict.jsonDc21A-ZAZ 06/08/2012 01:53 AM 7,386,011 dict.jsonA99 06/08/2012 05:18 AM 7,312,143 dict.jsonDc22O-EOE 06/08/2012 05:37 AM 7,305,776 dict.jsonDc23O-B_and_O-F 06/08/2012 12:30 PM 7,290,608 dict.jsonDc24O-POP 06/08/2012 12:40 PM 7,286,116 dict.jsonDc25O-LOL 06/08/2012 12:51 PM 7,284,844 dict.jsonDc26O-GOG 06/08/2012 01:01 PM 7,282,933 dict.jsonDc27O-TOT 06/08/2012 01:10 PM 7,281,163 dict.jsonDc28O-SOS 06/08/2012 03:40 PM 7,280,076 dict.jsonDc30O-DOD 06/08/2012 03:50 PM 7,279,586 dict.jsonDc31O-ZOZ 06/08/2012 06:34 PM 7,269,923 dict.jsonDc32S-ESE 06/08/2012 06:53 PM 7,268,272 dict.jsonDc33S-USU 06/08/2012 07:08 PM 7,260,433 dict.jsonDc34T-ETE 06/08/2012 07:13 PM 7,259,305 dict.jsonDc35T-UTU 06/08/2012 08:37 PM 7,252,035 dict.jsonDc36K-EKE 06/08/2012 08:39 PM 7,249,765 dict.jsonDc37K-UKU 06/08/2012 08:50 PM 0 dict.jsonDc38-EE missed these 1st pass 06/08/2012 09:16 PM 7,240,687 dict.jsonDc39P-EPE 06/08/2012 09:35 PM 7,230,956 dict.jsonDc40W-EWE 06/08/2012 09:41 PM 7,216,056 dict.jsonDc41H-EHE 06/08/2012 09:48 PM 7,186,221 dict.jsonDc42R-ERE 06/08/2012 10:03 PM 0 dict.jsonDc43-UU missed these 1st pass 06/08/2012 10:17 PM 7,184,559 dict.jsonDc44P-UPU 06/08/2012 10:22 PM 7,181,255 dict.jsonDc45W-UWU 06/08/2012 10:24 PM 7,176,346 dict.jsonDc46H-UHU 06/08/2012 10:28 PM 7,171,239 dict.jsonDc47R-U``RU 06/08/2012 10:28 PM 7,171,239 dict.jsonA98 06/08/2012 10:28 PM 7,171,239 dict_json_A98 This completes word conversion.

After this I worked on numbers.

I worked on the word part of the conversion 1st.

I worked on the number part after the word part was done.

A DigitalCAT format rule for any single stroke containing a number seems to be that the steno will begin with the character "#" so by sorting the entire dictionary I ended up with most (but not all) of the number entries grouped together. I then made a numbers-only

dictionary file to experiment with the numbers.

The Eclipse rule for any single stroke containing a number seems to be that the character "#" will NOT be in the steno -- only the number will be in the steno.

So that gives me my 1st numbercentric edit: delete all #s from the steno side. as long as a single stroke does not contain a number.

Eclipse seems to use the "#" character in the steno for a single stroke only when the numberbar is used without a number key being pressed. So for any strokes that did NOT have numbers 1234506789

I did not delete the "#" character.

06/09/2012 12:29 AM 7,171,227 dict.jsonA97#s 06/09/2012 10:57 PM 7,171,105 dict.jsonA96#doublingDone (11223344550066778899) (edited to make them work) 06/09/2012 11:17 PM 7,171,195 dict.json95BadTopEnd--------------\ 06/09/2012 11:23 PM 7,171,129 dict.json94GoodTop!!--------------->just markers 06/09/2012 11:35 PM 7,171,196 dict.json93GoodTop!!AndBottomZZZ--/ 06/09/2012 11:44 PM 7,171,196 !!dict92Alpha#Sorted.json <<<<<<<<<<<<<SORTED 06/10/2012 01:28 AM 46,968 #dict#only01.txt 06/10/2012 01:34 AM 44,675 #dict#only02_2293#sDeleted.txt 06/10/2012 02:07 AM 44,724 #dict#only02_2293#sDeletedB.json 06/10/2012 08:31 PM 44,281 !dict#only09.json

I continued to work on numbers, but in a different folder.

I would summarize about the numbers that 1st I sorted, then I isolated all the number entries starting with # then I worked on them in a numbers-only dict (removing the # character & some hyphens) but there were 75 other entries that were numbers that did not start with "#", so I had to work on those, also.

Note that this was to get the numbers to work for the "dictionary defined numbers" only.

At this point the right side numbers bug is present,

because the dictionary entries that address that bug are not in the dictionary yet.

Directory of C:\PloverDictBUsForMirabai2

06/08/2012 10:28 PM 7,171,239 cA98_No#sChangedYet.json 06/10/2012 10:13 PM 7,171,330 B88_A98_plus2Lines.json 06/10/2012 10:47 PM 7,171,330 B87_sorted.json 06/10/2012 11:00 PM 7,171,302 B75_sameAs_B86.json 06/10/2012 11:00 PM 7,171,302 B86_oneTestLineRemoved.json 06/10/2012 11:46 PM 0 B86oneLineDelFromFull.json- 06/10/2012 11:51 PM 48,398 B85#sMostly#s.json 06/11/2012 12:17 AM 46,002 B84_DelAll#s.json 06/11/2012 12:26 AM 46,006 B83put4#sBack.json 06/11/2012 12:32 AM 45,564 B82-EE-UU.json I guess I missed these before 06/11/2012 01:35 AM 45,546 B8159-D59D.json 06/11/2012 01:48 AM 45,597 B815-D5Drem.json 06/11/2012 02:01 AM 45,517 B800-D0D.json 06/11/2012 08:48 PM 45,515 B795-G5G5-R5R.json 06/12/2012 04:38 AM 46,399 B800-D0D_rem.json 06/12/2012 06:15 AM 45,937 B78Degree.json 06/12/2012 06:43 AM 45,933 B77_.json 06/12/2012 05:44 PM 7,123,010 B74FullButTop#sDeleted.json 06/13/2012 03:42 AM 7,122,977 B73!WorkgOnThe75#s.json 06/13/2012 04:46 AM 7,122,995 B72!WorkgOnThe75#s.json 06/13/2012 03:43 PM 7,122,988 B71!WorkgOnThe75#s.json 06/13/2012 04:01 PM 7,122,964 B70!WorkgOnThe75#s.json 06/13/2012 04:55 PM 7,122,966 B69!WorkgOnThe75#s.json 06/13/2012 05:48 PM 7,122,966 B68!WorkgOnThe75#sGUD.json (GUD=GOOD) 06/13/2012 06:21 PM 7,122,981 B67!WorkgOnThe75#sBAD.json 06/13/2012 07:35 PM 7,168,982 B66(B76onTopOfB67)BAD.json 06/13/2012 08:23 PM 7,122,939 B65!WorkgOnThe75#sGUD.json 06/13/2012 08:30 PM 7,168,940 B64_(B76onTopOfB65)GUD.json 06/14/2012 12:44 AM 7,137,292 B63_Slash-E_Slash-U.json /-E to /E /-U to /U 06/14/2012 12:59 AM 7,137,289 B62_0thru9-E_0thru9-U.json more -E to E -U to U 06/14/2012 01:09 AM 7,137,288 B61_Line232208_Del1#.json <<< Converted (maybe) I think the file B61 has all the needed changes to be in the eclipse format for plover, but I am not absolutely sure, because it needs more testing.

The right side numbers bug did not get (maybe) resolved until later, so it is still in B61

The top of this file contains entries that may fix the right side numbers bug:

06/16/2012 12:50 AM 7,142,087 B54_1234with-AllCombosOfRtSide.json

_End_Of_Message_And_End_OfFile-> ->

stenoknight commented 12 years ago

From the Launchpad site (for Eclipse-formatted dictionaries):

A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.


escape backslashes

%s/\/\/g

escape "

%s/"/\"/g

convert double spaces to single spaces

%s/ / /g

Remove lines with court reporter-specific paragraphing commands (this is drastic, but they cause no end of trouble. Will maybe try to support them

to some degree in a later version.)

%s/^.{$}.$\n// %s/^.\par\.$\n//

Convert steno half of entry to Python format

%s/{\.\cxs ([^}]+)}/"\1": /

Get rid of any lines that don't start with quotes. (i.e., more court reporting formatting residue)

%s/^[^"].*$\n//

Convert infixes.

%s/: \cxds (.*)\cxds/: {^\1^}/

Convert suffixes.

%s/: \cxds (.*)/: {^\1}/

Convert prefixes.

%s/: (.*)\cxds/: {\1^}/

Delete "force uncap" command (caption-specific command that Plover doesn't need to implement now, if ever.)

%s/{l1}//g %s/{l0}//g

Delete \cxp, the punctuation marker, since Plover recognizes specific punctuation marks independently.

%s/\cxp//g

Convert glue strokes.

%s/\cxfing /&/g

Convert "cap next" strokes.

%s/\cxfc /-|/g

Convert "stitch" strokes to suffix with hyphen.

%s/{\cxstit /{^-/

Search for other cx strokes and deal with them manually.

/cx

Delete spaces at ends of line.

%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)

Convert other half of entries.

:%s/^"([-A-Z0-9\/]+)": (.)$/"\1": "\2",

Put in curly brackets at beginning and end of dictionary

I'm sure there's a way to do this automatically, but I just did it manually.

You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:

http://stenoknight.com/plover/ploverdicts.zip

The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata.

balshetzer commented 11 years ago

I thought I'd put a reference here to the rtf cre spec: http://www.legalxml.org/workgroups/substantive/transcripts/cre-spec.htm

balshetzer commented 11 years ago

I ran my script on the dictionaries in the zip file and it ran int some problems with ab-digitalcat-0528.rtf because it had something in it that wasn't legal RTF. I took a look and that part of the file didn't make sense. Is it possible that there was some kind of copy paste change in that file or is it as it was on export?

balshetzer commented 11 years ago

Plover now supports RTF dictionaries natively.