non-english mw words - Githubissues

funderburkjim commented 2 years ago

This comment is one branch of #99.

By various means (which I'll describe below tomorrow), a list of 1509 words was developed which

are composed just of Normal a-zA-Z characters
which are confirmed to occur in mw (outside of markup)
which are not found in consulted English dictionaries (en_US and en_GB of 'enchant')

funderburkjim commented 2 years ago

The end results are iin two text files:

words_mw_noneng.txt shows each word and number of instances found in mw.
instance_mw_noneng.txt shows, for the same list, all the lines in MW where the word occurs.

Suggested first task

Although the Enchant dictionaries do not find these words as English words, nonetheless, a large portion of them look to actually be English words. Using Browser 'define: X' and/or https://www.merriam-webster.com/, some of these will be found directly. I suggest that the first task is to separate the words in words_mw_noneng.txt into two piles, depending on whether the word is found (with a plausible definition) in one of these online sources.

This is just a suggestion. What do you think @AnnaRybakovaT ? Good place to start? What else would you need from me to get started?

AnnaRybakovaT commented 2 years ago

Dear Jim, I am very glad to continue work by your guiding! During the day I will read everything and will try to start this task. If I have any questions, of course - I will ask your help.

AnnaRybakovaT commented 2 years ago

Dear Jim, The 1st task - everything is more than clear. Only let me know how do you prefer me to separate 2 groups of the words. Does this way suit ("nf" - not found, "found" - the word is exist in online sources)?

funderburkjim commented 2 years ago

@AnnaRybakovaT Your system of identification of the two cases looks consistent, and easy to work with. 👍 When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

funderburkjim commented 2 years ago

Construction details

This is a documentation summary of the files constructed leading up to the two files mentioned above. As we proceed further with the analysis, some of the small details may be relevant.
The actual programmatic steps are detailed in the readme,

Start with unique words extracted.txt, which has 51298 lines, each containing a 'word' derived somehow from the MW digitization.
the above separated into three parts
- 3 words_arabic.txt
- 107 words_nonascii.txt words containing a non-ascii character
- Some of these words need recoding in mw.txt
- 51188 (temp_words_00.txt) -- the remaining words.
The remaining words were analyzed into two parts:
- 24305 words_01.txt were those words consisting only of alphabetic characters ([a-zA-Z]) except for possible ending punctuation [.,;:?!] ; and that ending punctuation removed, and duplicates removed (e.g. the words 'Arab' and 'Arab,' both resolve to 'Arab')
- 7825 words_other.txt words with non-alphabetic characters. In these, any ending punctuation was retained. There is room for further examination of sub-categories of these (numbers, hyphenated words, words beginning with a hyphen, probably other subcases).
words_01 was separated into two groups, based on whether the word was found in mw.txt.
- all mw text within markup was EXCLUDED when searching for words.
- specifically, in each line, a space character replaced any occurrences of these regex:
  - '<ab.*?</ab>', '<s>.*?</s>', '<ls.*?</ls>', '<info.*?/>', '<bot>.*?</bot>','<hom>.*?</hom>', '<etym>.*?</etym>', '<lang.*?</lang>', '<lex.*?</lex>','<s1.*?</s1>'
  - then the resulting text was split into words by 're.split(r'\b',text)`
  - finally, each such word was tested to be in words_01.txt
- 23744 words_mw.txt words from words_01 and in mw.txt
- words_notmw.txt words from words_01 not found in mw.txt
  - These also should be examined further. This is partly where the relation between this definition of 'words in mw' and the definition used to derive starting set unique words extracted.txt is important.
  - For example, why does the starting set of words contain bypersons as a word but this is not found in mw by the current analysis?
The list words_mw.txt is then divided into three pieces, depending on whether a word is identified as English or not. As mentioned, this determination is made by reliance upon two of the enchant English dictionaries
- 21954 words_mw_US.txt using the 'en_US' dictionary
- 281 words_mw_GB.txt additional words identified as English using the 'en_GB' dictionary
- 1509 words_mw_noneng.txt the remaining words of words_mw.txt

AnnaRybakovaT commented 2 years ago

When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

Dear Jim, What do you think, maybe better from the beginning to add some comments regarding "not found" words? Now I see 3 categories:

not English origin: names, plants, geographical names
Wrong spelling English words (correct spelling I give in a comment)
rare cases which need more deep investigations.

There is only my suggestion. If it is better to make the 1st step as you described above (I mean - to add only "found" and "nf"), I will do it by this way.

funderburkjim commented 2 years ago

Adding those extra comments to the 'nf' is fine, since it will help in the next step of further analysis of the nf.

gasyoun commented 2 years ago

Seems @AnnaRybakovaT is where she belongs to again, thanks @funderburkjim for the guidance.

AnnaRybakovaT commented 2 years ago

Dear Jim, I am still working with the file words_mw_noneng.txt The third part is ready (you can see the temporary results in the file words_mw_noneng_temp.txt). If you have any comments, please, let me know (I will try take them into account and include in futher analyzation).

AnnaRybakovaT commented 2 years ago

https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_temp.txt

funderburkjim commented 2 years ago

Hi, Anna -- we must have been communicating telepathically, as I was thinking 'Where is Anna?' earlier today! Will take a look at what you've been doing in the next day or two.

gasyoun commented 2 years ago

we must have been communicating telepathically

Indeed. I heard your question one day before you heard and asked Anna to push what he has. The task is big, so I proposed she splits it into parts. Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ? Last time it was around 12:00 Moscow time, or?

funderburkjim commented 2 years ago

Noon Moscow would be 8PM in New York (my time zone). That time ok with me.

I suggest one discussion point be how to proceed with less from me. I want to spend considerably more time on (a) improving my Sanskrit literacy, (b) a long-standing mathematics project ignored for almost 4 years now. There is a huge backlog of sanskrit-lexicon tasks that are currently assigned to me. I aim to address these, but at a less intensive pace.* Perhaps others will adopt some of these tasks, or perhaps others may wish to move the sanskrit-lexicon project into new directions. It will be interesting to see how things unfold.

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

Andhrabharati commented 2 years ago

words_notmw.txt words from words_01 not found in mw.txt

@funderburkjim

Many of these words (if not all) could be traced in the mw text, by regex searching for the word followed by [^.], i.e., xxx[^.]

You seem to have missed some of these, as you had removed the ending punctuation mark!!

As such, you may update the (above) lists by you, after checking.

funderburkjim commented 2 years ago

@AnnaRybakovaT Thoughts looking at 'words_mw_noneng_temp.txt'

You have apparently been looking at instances in mw.txt also. Your marking of 'typo' is excellent, as this will indicate corrections that need to be made. Noticed 'Vallasor' also needs to be marked typo.
You have also marked 20 or so as 'print change'. It is less certain how to handle these, but fine that you marked them. For example, consider Carroway 3; nf; plant "Caraway" (print change) . Maybe we should invent a new markup <probably n="Caraway">Carroway</probably> that would provide a tooltip to users, but would leave the 'Carroway' spelling in place. [There might be a better name for the tag 'probably']. There are a few different types of print (such as 'cornifex') that apparently have other issues besides spelling.
The brief comments by many 'nf' I find good, such as Catarkot 1; nf; geographical name "Chatarkot". Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

We should probably somehow make use of the accepted words (i.e., those whose spelling we decide to leave unchanged) in mw.txt. For example, the word 'Capricornus' appears in AP90.txt and is one that you 'found'. If we do a similar study of words in AP90, then we should build on your work, and therefore accept 'Capricornus' as ok, even though it is not among the Enchant English words. [Note 'Carroway' also in AP90].

It seems you have examined about 56% of the cases. Keep going!

funderburkjim commented 2 years ago

@Andhrabharati re words_notmw.txt

Note that within my analysis (see Construction details note above) all mw text within markup was EXCLUDED when searching for words

For example 'Acacia' appears in words_notmw.txt. Within mw.txt, this word DOES occur 113 times, but always within a 'bot' element, e.g. <bot>Acacia Sirissa</bot>.

Andhrabharati commented 2 years ago

Yes, checked that they are all marked now; but they weren't at the time of my working those days (during March 2021).

These are the 4 lines from the mw_iast.txt (dt 04.04.21) by you, which was the last one I had considered (after which I stopped tracking the mw, and shifted to other works)-

<L>44900<pc>257,1<k1>karṇamoṭā<k2>kárṇa—moṭā<e>3 <s>kárṇa—moṭā</s> ¦ <lex>f.</lex> Acacia arabica, <ls>L.</ls><info lex="f"/> <LEND> <L>46461<pc>264,1<k1>kavarī<k2>kavarī<e>1B ¦ Acacia arabica or another plant, <ls>Npr.</ls><info lex="inh"/> <LEND> <L>85434<pc>448,3<k1>tīkṣṇakaṇṭaka<k2>tīkṣṇá—kaṇṭaka<e>3A ¦ Acacia arabica, <ls>Npr.</ls><info lex="inh"/> <LEND> <L>148230<pc>745,3<k1>bhaṇḍila<k2>bhaṇḍila<e>3A ¦ Acacia or <bot>Mimosa Sirissa</bot>, <ls>L.</ls><info lex="inh"/> <LEND>

Anyways, there are just about 500 words in the "words_notmw.txt", and is not a big issue to discuss more. [All those might have got updated in the later days.]

AnnaRybakovaT commented 2 years ago

Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

Dear Jim, Many thanks for your comments. Now I am more confident that everything is going well. Regarding the obscure 'found' words - I can double check and write short explanations.

Andhrabharati commented 2 years ago

Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ?

What would be the agenda, @gasyoun? And do you think I have a role to "play"?

gasyoun commented 2 years ago

What would be the agenda

One does not know in advance.

And do you think I have a role to "play"?

Yes, it will increase in 2022-2032.

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG

Sounds like a plan.

long-standing mathematics project ignored for almost 4 years now

Can I send you a mathemathician to help out so you can ignore it even longer?

spend considerably more time on (a) improving my Sanskrit literacy

As per Sanskrit literacy - may I know what do exactly do you want to read?

invent a new markup Carroway that would provide a tooltip to users, but would leave the 'Carroway' spelling in place

Exactly, kind of ghostword or newEnglish. But as we have German dicitonaries with the same issues, so ghostword could be used?

accept 'Capricornus' as ok, even though it is not among the Enchant English words

Exactly.

Regarding the obscure 'found' words - I can double check and write short explanations.

So glad @AnnaRybakovaT is back - not only beutifull, but smart and hard working she is.

funderburkjim commented 2 years ago

what do you want to read?

For starters, Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses -- I would like to be able to dip into any of these and sight read with ease.

Andhrabharati commented 2 years ago

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

@funderburkjim I would like you including Ramayana and Mahabharata as link tagets, which are some of the major ones; and SCH for ls markup, as it goes with pwk and PWG as a set; and then take a break.

I am presently working on SCH and likely to be posting the results, before this month ending.

AnnaRybakovaT commented 2 years ago

words_mw_noneng.txt shows each word and number of instances found in mw.

Dear Jim, Finally I have finished analyzing this file. The results are contained in the file: https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_1.txt

Addendum to Anna's comment of Jan 24, 2022 (Jim) Anna's file was renamed (01-22-2024) to

words_mw_noneng_1.txt

gasyoun commented 2 years ago

I am presently working on SCH and likely to be posting the results, before this month ending.

May you never feel weekness.

Finally I have finished analyzing this file. The results are contained in the file:

Absolutely impressed.

Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses

It's good you started with Kale. Indishe Spruch are mostly hard to understand, as is sometimes Bhagavad Gita. Peter's Ramopakhyana is interesting, but still more advanced than Lanman reader stories. It's good you started with Kale.

Andhrabharati commented 2 years ago

Good work done, @AnnaRybakovaT; you indeed are a smart worker as @gasyoun mentioned above.

Just seen that there are some missings and errors in your file, and I'm sure @funderburkjim would be reviewing them all over before incorporating them into Cologne files.

Here are a few quick ones-

Galmei 1;   nf  German word for Calamine
Habush 2;   nf  a plant name in Bengali; look at the SKD entry हपुषा.
Mooltan 1;  nf  a place name (Multān)
annumeration 2; nf  Addition to a former number (Webster's)
antiphlegmatic 2;   nf  anti-phlegmatic (used to reduce phlegm)
nonne 1;    nf      a Latin word used in interrogation
-----------------
Chandoiu 1; nf; //looks like a Sanskrit word// this is a typo for Chandom. (abbr. for Chandomanjari)

AnnaRybakovaT commented 2 years ago

Just seen that there are some missings and errors in your file

Thanks a lot for your checking and explanation of missing cases (I had no ideas what it could be)!!!

Andhrabharati commented 2 years ago

@funderburkjim

Would you mind regenerating the "latest" iast and deva files for the mw.txt?

I have noticed quite a few issues that need corrections, and thought of doing a complete proofing once for all. This time, I estimate a time-frame of about 6-8 months for the full proofing.

Hope to see your response soon on this.

Andhrabharati commented 2 years ago

@drdhaval2785

Would you be interested to do this [as @funderburkjim is either not interested in this proposal, or did not "see" this above post yet (being busy on PWG ls working)]?

Or else, I will take up some other big work for a long term, starting a few days from now.

drdhaval2785 commented 2 years ago

You want new devanagari files, I can. I am not sure about IAST though.

drdhaval2785 commented 2 years ago

https://github.com/sanskrit-lexicon/csl-devanagari/blob/main/v02/mw/mw.txt is the latest MW Devanagari version.

Andhrabharati commented 1 year ago

In the last file by @AnnaRybakovaT at the https://github.com/sanskrit-lexicon/MWS/issues/127#issuecomment-1020585189, both

Rakshases 2; nf; Rakshasas & Ushases 1; nf; Ushas

are proper in the text, being the plural of Rakshas & Ushas respectively, and no change required in those words. Hope @funderburkjim would take this into account, while he 'works' on this file he has copied elsewhere.

Andhrabharati commented 5 months ago

@funderburkjim

I had seen you copying Anna's work after a gap of 6 months; and now another year-and-half has elapsed. Hope you might consider looking into her file and act upon the same, sometime sooner.

funderburkjim commented 5 months ago

@Andhrabharati Am taking up review of words_mw_noneng_1.txt.

funderburkjim commented 5 months ago

processing of nonenglish words.

Work directory is unique_eng.

words_mw_noneng_2.txt has my annotations of @AnnaRybakovaT file words_mw_noneng_1.txt.
- My comments ';; xxxx'
items generating a change in mw.txt indicated by ';; 2024 ...'
about 200 lines of mw.txt changed. See also changes_2.txt or the csl-orig commit above
About 40 of these were marked as print-changes 'PRINT CHANGE', and were posted to mw_printchange.txt (see csl-corrections commit above).

For a few old words, these were useful:

For Latin words, sometimes this was useful: https://www.online-latin-dictionary.com/latin-english-dictionary.php

funderburkjim commented 5 months ago

Further research and usage

There is a lot of good information in the research by @AnnaRybakovaT and @Andhrabharati. Not clear where to put it so that it may be available when needed another time. Maybe where @drdhaval2785 has put his word studies.

Andhrabharati commented 5 months ago

@funderburkjim

Though you have mentioned that (Anna's and) my 'research' contained some good info, you had ignored/skipped this post above.

Andhrabharati commented 5 months ago

A quick looking into the 40 print-changes prompted me to comment thus--

cerebralisation 1; nf; cerebralization (typo);; 2024 correction L=110300 niveSa PRINT CHANGE

;; AB there are few more cases of such 's-z' variants-- realization (5) vs. realisation (4); cauterization (4) vs. cauterisation (1) ;; AB these American and British spelling variations may be seen throughout the MW text (see for e.g. courtezan, courtesan) ;; AB thus. I feel that this particular 'print-change' correction is to be reverted back.

Andhrabharati commented 5 months ago

Another info, that I wanted to present here--

anum 1; nf; maybe "per annum" (in this case - print change) ;; no change. anum in pw, but otherwise not found

This does not indicate "per annum" as Anna thought; for the context (there are some more places that pw has used "per anum") seems to mean "from/by anus", anum being the inflected form of Anus (Latin word).

funderburkjim commented 5 months ago

@Andhrabharati Revised per your comment(s). For details, see commits above.

Andhrabharati commented 5 months ago

I presumed that these two plurals also would/should be marked, as <ns>Aṅgirases</ns> was.

funderburkjim commented 5 months ago

@Andhrabharati <ns> markup added. See commits above.

sanskrit-lexicon / MWS

non-english mw words #127

Suggested first task

Construction details

processing of nonenglish words.

Further research and usage