S(eite) to be removed from 'ls'

sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889

3 stars 1 forks source link

S(eite) to be removed from 'ls' #32

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

Taking a clue from https://github.com/sanskrit-lexicon/PWK/issues/31#issuecomment-165961508 and @gasyoun 's comment elsewhere, I see a lot of places where the S. has been included inside the reference and therefore shown inside 'ls' tag.

This can be caught by the regex ¯[A-Za-z0-9]*[.]S[.] in pw.txt

But we need to modify the regex to exclude the cases where S. stands for saMhitA or something of that sort e.g.

J.A.O.S. == Journal of the American Oriental Society. (vol. 1)
.MAITR. S.

etc from pwbib0.txt

We need to make the regex a bit more precise to exclude such cases.

@funderburkjim may help with the regex.

Such 'S.' may then be added with a space before i.e. S.

drdhaval2785 commented 8 years ago

https://github.com/sanskrit-lexicon/PWK/commit/2ea11319d1834856b776177e409519f5d9d67e3d This adjusted the abbrv.py code to account for this problem.

The entries where there was trailing [.]S$ and re.sub('[.]S$','',reference) was present in the cleanrefs, were removed.

Stats - sortedcrefs.txt decreased from 2332 to 2293. 39 deletions For deletions see the commit.

Now whatever trailing [.]S$ are there, they are either due to some typo error or their parent is not in the cleanrefs.

gasyoun commented 8 years ago

Nice one. You'll learn German by end of this cleanup round.

drdhaval2785 commented 8 years ago

I always wonder why am I cleaning up a dictionary that I can't read. Then I look at Jim, and I feel that I am not alone in that boat.

gasyoun commented 8 years ago

And I look at both of you, reading German perfectly and understand that I can't stay calm and sit there doing other cleanup jobs. Dhaval, one thing you must know. There is no bigger Sanskrit dictionary in the world than PWG (and PWK as it's little brother). No bigger there will be in next 100-200 years, so you are not working in a museum. You are working on the real Poona dictionary, one that can be done in a lifetime.

funderburkjim commented 8 years ago

@drdhaval2785

Regarding Seite.

I had some difficulty in syncing when I began working on PWK yesterday. Thus, as you've probably noticed, I added a big list of 'Seite' changes in abbrv.py (see #31)

As I now read your posting, it appears that I may have clobbered changes you made to abbrv.py to solve this problem.

Hopefully, my change is functionally the same as was yours. If this causes some problem from your view, let's discuss further.

drdhaval2785 commented 8 years ago

My logic was to remove trailing .S whenever the word without .S was in cleanrefs.txt. This way we could decrease chances of wrong removal of .S

Because their substituted counterparts would not be in cleanrefs.txt.

This was generic way of doing it. Yours was enumeration way of doing it. I will see the output of both and see what should be done.

funderburkjim commented 8 years ago

@drdhaval2785 Just to be clear, here are the lines of abbrv.py that you had written to deal with Seite:

    for (p,q,r,s) in cleanrefs:
        #if p not in ur1 and re.sub('[.]S$','',p) not in ur1:
        # Dec 31, 2015 (ejf). Removed the 'S' logic, as it
        # inhibits matching of 'KAP.S', for instance.
        if p not in ur1 :

As the note says, the specific reason for the change was that this code improperly removes '.S' from 'KAP.S' (there may be other abbreviations that legitimately end in '.S', I'm not sure). That's why I thought the enumeration approach was safer.

drdhaval2785 commented 8 years ago

As long as any approach gives what we want, I agree. I agree with enumeration approach. Please proceed with it.