sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

fuzzy suggestions for correction submission #44

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

Taking a clue from numfuzzy effort of @funderburkjim (and mostly adaptation of his code there + https://github.com/funderburkjim/fuzzyalpha-example), I have tried to give the suggestions for corrections in the submission files. https://github.com/sanskrit-lexicon/PWK/commit/d8d74e8dadcca277a327ffd627fca387a879c384 is the commit responsible.

Code amended is stdabbrv.sh and stdabbrv.py to accomodate the fuzzy logic.

The logic is

  1. For any cref entry, if there is a fuzzy match in pwbib1.txt, it is shown as ¯ls@key1@key2@lnum:¯suggestion:t: e.g. ¯BURNELL.T@maDvaBAzya@maDvaBAzya@82746:¯BURNELL,T:t:
  2. If there is not fuzzy match, it is shown as ¯ls@key1@key2@lnum:¯suggestion:n: e.g. ¯BHA7G.P.ed.Bomb@anudapAna@anudapAna@4432:¯BHA7G.P.ed.Bomb:n:

Thus now the submission is reasonably improved. If the suggestion is fine, leave it as it is.

drdhaval2785 commented 8 years ago

Just to give a glimpse of the output, I am copy pasting 20 entries from cmbsub.txt here.

¯BURNELL.T@maDvaBAzya@maDvaBAzya@82746:¯BURNELL,T:t:
¯C2A7N5KH@aGAhan@aGAhan@849:¯C2A7K:t:
¯HEM@cItkfta@cItkfta@40339:¯H:t:
¯K4AMAPAKA@rahitatva@°rahitatva@93133:¯K4AMPAKA:t:
¯BA7G4AN@ajara@aja/ra@1271:¯RA7G4AN:t:
¯Vardh@paryAya@paryAya@64793:¯Va7rtt:t:
¯K4ARARA@udamehin@udamehin@18709:¯K4ARAKA:t:
¯PAN4K4AT.ed.Bomb@antarvAsika@antarvAsika@5371:¯PAN4K4AT.ed.orn:t:
¯R.GORR@aDiyoDa@aDiyoDa@2889:¯GOBH:t:
¯K4D@aravindinI@aravindinI@9028:¯KA7D:t:
¯C2a7n5kh@upasTa@upa/sTa@20180:¯C2A7K:t:
¯BA7DAR.S@anupraveSa@anupraveSa@4632:¯BA7DAR:t:
¯BHA7G.P.ed.Bomb@anudapAna@anudapAna@4432:¯BHA7G.P.ed.Bomb:n:
¯A7PST@aniha@aniha@4217:¯A7PAST:t:
¯A7PAST.GAUT@Atmavant@Atmavant@14243:¯A7PAST.C2R:t:
¯H4MA7DRI@udvaMSa@udvaMSa@19185:¯HEMA7DRI:t:
¯A7RSHBr@pUrvAtiTa@pUrvAtiTa@69228:¯A7RSH.BR:t:
¯GR2HJ@digvyAGAraRa@digvyAGAraRa@50109:¯GOBH:t:
¯MALLIN@aparicita@aparicita@6129:¯LALIT:t:
¯A7RJABH.S@atyazwi@atyazwi@2280:¯A7RJABH:t:
gasyoun commented 8 years ago

It's impossible for me to work in such a UI. I do not understand where to look at. Still a HTML would be desirable or am I the only one? Excuse me for complaining, it's mega work done, just can't help in such format. It's too user-unfriendly. I lack IAST, but that's my issue after all.

drdhaval2785 commented 8 years ago

@gasyoun http://sanskrit-lexicon.github.io/PWK/cmbsub.html Not proper ? I work on this UI. Didn't face much issue. the txt files are for correction submission in standard format. Not for viewing.

drdhaval2785 commented 8 years ago

Fuzzy suggestions are given now in the text file. So, safe to close this documentation issue..

gasyoun commented 8 years ago

@drdhaval2785 what if last, additional column of HTML would contain the TXT line? In that case I could copypaste it without looking for the same entry in TXT. Most fixes are easy. I could fix them in seconds. But the way it is it takes minutes or I just abandon submitting at all.

drdhaval2785 commented 8 years ago

Dear Gasyoun, you need to look at pw.txt without fail. The reason is - I display only one entry which refers to the work (alphabetic first i guess). But there are many cases which are not enlisted. E.g. the submission of 'Calc' referred to at least three differrent works. If I had gone by the entry displayed in HTML, I would have wrongly altered the rest of the books. And showing all occurrences of a reference is not an option either. The file would be more than 10000 entries long. Don't want to duscourage people by size. Thats why I am hiding the other occurrences of the same reference.

The way i work is- keep HTML file open in firefox. Keep submission file and pw.txt open in notepad++ side by side. Copy paste into search box of notepad++ from submission file. 'Find all in current docunent' if there are more than one entries - I click on them in notepad++ and see their context. If I can't decide the entry from text file, I see HTML and finally submit.