Closed balshetzer closed 11 years ago
Someone's also mentioned the possibility of a steno student starting on Plover and being able to export their Plover dictionary to rtf/cre format so that they can use it with commercial steno software. That's a pretty low priority, in my opinion, but something to consider as an additional feature in the conversion script.
Some steps towards converting from DigitalCAT-formatted rtf/cre to Eclipse-formatted rtf/cre, which will make it convertible to Plover's json format. (Thanks to Ed)
//============Optional dual installation comments start============\
My install of ploverwin ver 212 dated approx April 16, 2012 uses a dict format setting of "DCAT" in the config file. It works well for words (but not numbers and not commands).
My next polverwin ver install after that was ver 220 dated approx June 4, 2012. Ploverwin ver 220 has errors using the DC format, so it must use the dict format setting of "Eclipse" in the config.json file.
After I installed ver 220, I still had ver 212 installed, so I had 2 separate ploverwin installations, at 2 different locations, using 2 "different Kim dictionaries", using 2 different dictionary configuration settings.
Note that I run only a single version of ploverwin at a time.
Also note that when I say "different Kim dictionaries" I mean different names, different locations, different config settings, but the actual dictionary entries started out being exactly the same (prior to the start of the conversion change process).
This dual installation situation helped me to experiment because I could easily test for word output differences between that DC supported version and the Eclipse only version.
\============Optional dual installation comments end====================//
When Plover refuses to output English due to steno in the wrong format, it often outputs the correctly formatted steno that it wants to see in the dictionary.
I looked in your dictionary, Mirabai, to see if certain character combinations were present or not. If they were NOT present in your dictionary -- then I removed those character combinations from Kim's DC format dictionary
I am a novice user of TED Notepad. http://jsimlo.sk/notepad/download.php
My editing method is: "find & replace all", case sensitive on, regex off. But before "find & replace all", I use "find", case sensitive on, regex on to test to see if a "find & replace all" will do what I want it to do,
To search only the English (right side only), I use the following Perl regex: ": ".#.$
To search only the steno (left side only), I use the following Perl regex: ^.#.": (Note that there is a space after the colon.)
I don't know if it matters in what order the changes are done. I usually change one character combination at a time. My 1st change was "-" to "". My 2nd change was "-E" to "E".
As you see below, BU files have names according to their changes. (If anyone wants the ~654 megs of BU files I will gladly send them.) Here are directory listings of dictionary BU files sorted by time:
Directory of C:\PloverDictBUsForMirabai1
05/26/2012 04:03 AM 7,593,697 dictDcat001.json Used with ver212 DC format
06/06/2012 12:49 PM 7,593,852 dict.jsonDc04 Added PloverToggle for ver220
06/06/2012 02:15 PM 7,535,724 dict.jsonDc05StarHyphen
Star ("-" to "")
1st edit to convert from dc to eclipse format
06/06/2012 03:21 PM 7,527,938 dict.jsonDc06-E
E
06/06/2012 06:16 PM 7,526,967 dict.jsonDc07-U
U
06/07/2012 06:27 AM 7,515,531 dict.jsonDc08O-R
OR(3)
(3 English entries to manually change back after find & replace)
06/07/2012 04:13 PM 7,491,127 dict.jsonDc09O-U
OU
06/07/2012 05:02 PM 7,488,026 dict.jsonDc10A-T
AT
06/07/2012 05:18 PM 7,466,532 dict.jsonDc11A-P
AP(3)
06/07/2012 05:39 PM 7,458,785 dict.jsonDc12A-B
AB(2)
06/07/2012 05:54 PM 7,456,626 dict.jsonDc13A-D
AD(1)
06/07/2012 07:07 PM 7,419,784 dict.jsonDc14A-E
AE
06/07/2012 07:17 PM 7,416,467 dict.jsonDc15A-F
AF
06/07/2012 07:56 PM 7,415,002 dict.jsonDc16A-G
AG(4)
(4 English entries to manually change back after find & replace)
06/07/2012 11:15 PM 7,407,436 dict.jsonDc17A-L
AL
06/08/2012 01:17 AM 7,396,034 dict.jsonDc18A-R
AR
06/08/2012 01:31 AM 7,393,193 dict.jsonDc19A-S
AS(1)
06/08/2012 01:41 AM 7,386,891 dict.jsonDc20A-U
AU
06/08/2012 01:53 AM 7,386,011 dict.jsonDc21A-Z
AZ
06/08/2012 01:53 AM 7,386,011 dict.jsonA99
06/08/2012 05:18 AM 7,312,143 dict.jsonDc22O-E
OE
06/08/2012 05:37 AM 7,305,776 dict.jsonDc23O-B_and_O-F 06/08/2012 12:30 PM 7,290,608 dict.jsonDc24
O-POP 06/08/2012 12:40 PM 7,286,116 dict.jsonDc25
O-LOL 06/08/2012 12:51 PM 7,284,844 dict.jsonDc26
O-GOG 06/08/2012 01:01 PM 7,282,933 dict.jsonDc27
O-TOT 06/08/2012 01:10 PM 7,281,163 dict.jsonDc28
O-SOS 06/08/2012 03:40 PM 7,280,076 dict.jsonDc30
O-DOD 06/08/2012 03:50 PM 7,279,586 dict.jsonDc31
O-ZOZ 06/08/2012 06:34 PM 7,269,923 dict.jsonDc32
S-ESE 06/08/2012 06:53 PM 7,268,272 dict.jsonDc33
S-USU 06/08/2012 07:08 PM 7,260,433 dict.jsonDc34
T-ETE 06/08/2012 07:13 PM 7,259,305 dict.jsonDc35
T-UTU 06/08/2012 08:37 PM 7,252,035 dict.jsonDc36
K-EKE 06/08/2012 08:39 PM 7,249,765 dict.jsonDc37
K-UKU 06/08/2012 08:50 PM 0 dict.jsonDc38
-EE missed these 1st pass 06/08/2012 09:16 PM 7,240,687 dict.jsonDc39
P-EPE 06/08/2012 09:35 PM 7,230,956 dict.jsonDc40
W-EWE 06/08/2012 09:41 PM 7,216,056 dict.jsonDc41
H-EHE 06/08/2012 09:48 PM 7,186,221 dict.jsonDc42
R-ERE 06/08/2012 10:03 PM 0 dict.jsonDc43
-UU missed these 1st pass 06/08/2012 10:17 PM 7,184,559 dict.jsonDc44
P-UPU 06/08/2012 10:22 PM 7,181,255 dict.jsonDc45
W-UWU 06/08/2012 10:24 PM 7,176,346 dict.jsonDc46
H-UHU 06/08/2012 10:28 PM 7,171,239 dict.jsonDc47
R-U``RU
06/08/2012 10:28 PM 7,171,239 dict.jsonA98
06/08/2012 10:28 PM 7,171,239 dict_json_A98 This completes word conversion.
I worked on the word part of the conversion 1st.
A DigitalCAT format rule for any single stroke containing a number seems to be that the steno will begin with the character "#" so by sorting the entire dictionary I ended up with most (but not all) of the number entries grouped together. I then made a numbers-only
The Eclipse rule for any single stroke containing a number seems to be that the character "#" will NOT be in the steno -- only the number will be in the steno.
So that gives me my 1st numbercentric edit: delete all #s from the steno side. as long as a single stroke does not contain a number.
Eclipse seems to use the "#" character in the steno for a single stroke only when the numberbar is used without a number key being pressed. So for any strokes that did NOT have numbers 1234506789
06/09/2012 12:29 AM 7,171,227 dict.jsonA97#s 06/09/2012 10:57 PM 7,171,105 dict.jsonA96#doublingDone (11223344550066778899) (edited to make them work) 06/09/2012 11:17 PM 7,171,195 dict.json95BadTopEnd--------------\ 06/09/2012 11:23 PM 7,171,129 dict.json94GoodTop!!--------------->just markers 06/09/2012 11:35 PM 7,171,196 dict.json93GoodTop!!AndBottomZZZ--/ 06/09/2012 11:44 PM 7,171,196 !!dict92Alpha#Sorted.json <<<<<<<<<<<<<SORTED 06/10/2012 01:28 AM 46,968 #dict#only01.txt 06/10/2012 01:34 AM 44,675 #dict#only02_2293#sDeleted.txt 06/10/2012 02:07 AM 44,724 #dict#only02_2293#sDeletedB.json 06/10/2012 08:31 PM 44,281 !dict#only09.json
I would summarize about the numbers that 1st I sorted, then I isolated all the number entries starting with # then I worked on them in a numbers-only dict (removing the # character & some hyphens) but there were 75 other entries that were numbers that did not start with "#", so I had to work on those, also.
Note that this was to get the numbers to work for the "dictionary defined numbers" only.
At this point the right side numbers bug is present,
Directory of C:\PloverDictBUsForMirabai2
06/08/2012 10:28 PM 7,171,239 cA98_No#sChangedYet.json
06/10/2012 10:13 PM 7,171,330 B88_A98_plus2Lines.json
06/10/2012 10:47 PM 7,171,330 B87_sorted.json
06/10/2012 11:00 PM 7,171,302 B75_sameAs_B86.json
06/10/2012 11:00 PM 7,171,302 B86_oneTestLineRemoved.json
06/10/2012 11:46 PM 0 B86oneLineDelFromFull.json-
06/10/2012 11:51 PM 48,398 B85#sMostly#s.json
06/11/2012 12:17 AM 46,002 B84_DelAll#s.json
06/11/2012 12:26 AM 46,006 B83put4#sBack.json
06/11/2012 12:32 AM 45,564 B82-E
E-U
U.json I guess I missed these before
06/11/2012 01:35 AM 45,546 B8159-D
59D.json
06/11/2012 01:48 AM 45,597 B815-D
5Drem.json
06/11/2012 02:01 AM 45,517 B800-D
0D.json
06/11/2012 08:48 PM 45,515 B795-G
5G5-R
5R.json
06/12/2012 04:38 AM 46,399 B800-D
0D_rem.json
06/12/2012 06:15 AM 45,937 B78Degree.json
06/12/2012 06:43 AM 45,933 B77_.json
06/12/2012 05:44 PM 7,123,010 B74FullButTop#sDeleted.json
06/13/2012 03:42 AM 7,122,977 B73!WorkgOnThe75#s.json
06/13/2012 04:46 AM 7,122,995 B72!WorkgOnThe75#s.json
06/13/2012 03:43 PM 7,122,988 B71!WorkgOnThe75#s.json
06/13/2012 04:01 PM 7,122,964 B70!WorkgOnThe75#s.json
06/13/2012 04:55 PM 7,122,966 B69!WorkgOnThe75#s.json
06/13/2012 05:48 PM 7,122,966 B68!WorkgOnThe75#sGUD.json (GUD=GOOD)
06/13/2012 06:21 PM 7,122,981 B67!WorkgOnThe75#sBAD.json
06/13/2012 07:35 PM 7,168,982 B66(B76onTopOfB67)BAD.json
06/13/2012 08:23 PM 7,122,939 B65!WorkgOnThe75#sGUD.json
06/13/2012 08:30 PM 7,168,940 B64_(B76onTopOfB65)GUD.json
06/14/2012 12:44 AM 7,137,292 B63_Slash-E_Slash-U.json /-E to /E /-U to /U
06/14/2012 12:59 AM 7,137,289 B62_0thru9-E_0thru9-U.json more -E to E -U to U
06/14/2012 01:09 AM 7,137,288 B61_Line232208_Del1#.json <<< Converted (maybe)
I think the file B61 has all the needed changes to be in the eclipse format for plover,
but I am not absolutely sure, because it needs more testing.
The top of this file contains entries that may fix the right side numbers bug:
_End_Of_Message_And_End_OfFile-> ->
From the Launchpad site (for Eclipse-formatted dictionaries):
A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.
%s/\/\/g
%s/"/\"/g
%s/ / /g
%s/^.{$}.$\n// %s/^.\par\.$\n//
%s/{\.\cxs ([^}]+)}/"\1": /
%s/^[^"].*$\n//
%s/: \cxds (.*)\cxds/: {^\1^}/
%s/: \cxds (.*)/: {^\1}/
%s/: (.*)\cxds/: {\1^}/
%s/{l1}//g %s/{l0}//g
%s/\cxp//g
%s/\cxfing /&/g
%s/\cxfc /-|/g
%s/{\cxstit /{^-/
/cx
%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)
:%s/^"([-A-Z0-9\/]+)": (.)$/"\1": "\2",
You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:
http://stenoknight.com/plover/ploverdicts.zip
The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata.
I thought I'd put a reference here to the rtf cre spec: http://www.legalxml.org/workgroups/substantive/transcripts/cre-spec.htm
I ran my script on the dictionaries in the zip file and it ran int some problems with ab-digitalcat-0528.rtf because it had something in it that wasn't legal RTF. I took a look and that part of the file didn't make sense. Is it possible that there was some kind of copy paste change in that file or is it as it was on export?
Plover now supports RTF dictionaries natively.
Many Plover users have steno experience with other programs and therefore have mature dictionaries in those programs' formats. A tool should exist to easily convert other programs' dictionaries to the Plover dictionary format.