Closed funderburkjim closed 5 months ago
The end results are iin two text files:
Although the Enchant dictionaries do not find these words as English words, nonetheless, a large portion of them look to actually be English words. Using Browser 'define: X' and/or https://www.merriam-webster.com/, some of these will be found directly. I suggest that the first task is to separate the words in words_mw_noneng.txt into two piles, depending on whether the word is found (with a plausible definition) in one of these online sources.
This is just a suggestion. What do you think @AnnaRybakovaT ? Good place to start? What else would you need from me to get started?
Dear Jim, I am very glad to continue work by your guiding! During the day I will read everything and will try to start this task. If I have any questions, of course - I will ask your help.
Dear Jim,
The 1st task - everything is more than clear. Only let me know how do you prefer me to separate 2 groups of the words. Does this way suit ("nf" - not found, "found" - the word is exist in online sources)?
@AnnaRybakovaT Your system of identification of the two cases looks consistent, and easy to work with. đ When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.
This is a documentation summary of the files constructed leading up to the two files mentioned above.
As we proceed further with the analysis, some of the small details may be relevant.
The actual programmatic steps are detailed in the readme,
[.,;:?!]
; and that ending punctuation removed, and duplicates removed (e.g. the words 'Arab' and 'Arab,' both resolve to 'Arab')'<ab.*?</ab>', '<s>.*?</s>', '<ls.*?</ls>', '<info.*?/>', '<bot>.*?</bot>','<hom>.*?</hom>', '<etym>.*?</etym>', '<lang.*?</lang>', '<lex.*?</lex>','<s1.*?</s1>'
unique words extracted.txt
is important. bypersons
as a word but this is not found in mw by the current analysis?When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.
Dear Jim, What do you think, maybe better from the beginning to add some comments regarding "not found" words? Now I see 3 categories:
There is only my suggestion. If it is better to make the 1st step as you described above (I mean - to add only "found" and "nf"), I will do it by this way.
Adding those extra comments to the 'nf' is fine, since it will help in the next step of further analysis of the nf.
Seems @AnnaRybakovaT is where she belongs to again, thanks @funderburkjim for the guidance.
Dear Jim, I am still working with the file words_mw_noneng.txt The third part is ready (you can see the temporary results in the file words_mw_noneng_temp.txt). If you have any comments, please, let me know (I will try take them into account and include in futher analyzation).
Hi, Anna -- we must have been communicating telepathically, as I was thinking 'Where is Anna?' earlier today! Will take a look at what you've been doing in the next day or two.
we must have been communicating telepathically
Indeed. I heard your question one day before you heard and asked Anna to push what he has. The task is big, so I proposed she splits it into parts. Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ? Last time it was around 12:00 Moscow time, or?
Noon Moscow would be 8PM in New York (my time zone). That time ok with me.
I suggest one discussion point be how to proceed with less from me. I want to spend considerably more time on (a) improving my Sanskrit literacy, (b) a long-standing mathematics project ignored for almost 4 years now. There is a huge backlog of sanskrit-lexicon tasks that are currently assigned to me. I aim to address these, but at a less intensive pace.* Perhaps others will adopt some of these tasks, or perhaps others may wish to move the sanskrit-lexicon project into new directions. It will be interesting to see how things unfold.
- words_notmw.txt words from words_01 not found in mw.txt
@funderburkjim
Many of these words (if not all) could be traced in the mw text, by regex searching for the word followed by [^.], i.e., xxx[^.]
You seem to have missed some of these, as you had removed the ending punctuation mark!!
As such, you may update the (above) lists by you, after checking.
@AnnaRybakovaT Thoughts looking at 'words_mw_noneng_temp.txt'
Carroway 3; nf; plant "Caraway" (print change)
. Maybe we should
invent a new markup <probably n="Caraway">Carroway</probably>
that would provide a tooltip to users, but would leave the 'Carroway' spelling in place. [There might be a better name for the tag 'probably'].
There are a few different types of print (such as 'cornifex') that apparently have other issues besides spelling.Catarkot 1; nf; geographical name "Chatarkot"
.
Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found
?We should probably somehow make use of the accepted words (i.e., those whose spelling we decide to leave unchanged) in mw.txt. For example, the word 'Capricornus' appears in AP90.txt and is one that you 'found'. If we do a similar study of words in AP90, then we should build on your work, and therefore accept 'Capricornus' as ok, even though it is not among the Enchant English words. [Note 'Carroway' also in AP90].
It seems you have examined about 56% of the cases. Keep going!
@Andhrabharati re words_notmw.txt
Note that within my analysis (see Construction details note above)
all mw text within markup was EXCLUDED when searching for words
For example 'Acacia' appears in words_notmw.txt. Within mw.txt, this word DOES occur 113 times,
but always within a 'bot' element, e.g. <bot>Acacia Sirissa</bot>
.
Yes, checked that they are all marked now; but they weren't at the time of my working those days (during March 2021).
These are the 4 lines from the mw_iast.txt (dt 04.04.21) by you, which was the last one I had considered (after which I stopped tracking the mw, and shifted to other works)-
<L>44900<pc>257,1<k1>karášamoášÄ<k2>kĂĄrášaâmoášÄ<e>3 <s>kĂĄrášaâmoášÄ</s> ÂŚ <lex>f.</lex> Acacia arabica, <ls>L.</ls><info lex="f"/> <LEND>
<L>46461<pc>264,1<k1>kavarÄŤ<k2>kavarÄŤ<e>1B ÂŚ Acacia arabica or another plant, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>85434<pc>448,3<k1>tÄŤkᚣášakaášášaka<k2>tÄŤkᚣášĂĄâkaášášaka<e>3A ÂŚ Acacia arabica, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>148230<pc>745,3<k1>bhaášá¸ila<k2>bhaášá¸ila<e>3A ÂŚ Acacia or <bot>Mimosa Sirissa</bot>, <ls>L.</ls><info lex="inh"/> <LEND>
Anyways, there are just about 500 words in the "words_notmw.txt", and is not a big issue to discuss more. [All those might have got updated in the later days.]
- Have you considered similar comments for some of the obscure 'found' words, such as
Cambay 1; found
?
Dear Jim, Many thanks for your comments. Now I am more confident that everything is going well. Regarding the obscure 'found' words - I can double check and write short explanations.
Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ?
What would be the agenda, @gasyoun? And do you think I have a role to "play"?
What would be the agenda
One does not know in advance.
And do you think I have a role to "play"?
Yes, it will increase in 2022-2032.
Getting Indishe spruch (boesp) into a link target for PW(K) and PWG
Sounds like a plan.
long-standing mathematics project ignored for almost 4 years now
Can I send you a mathemathician to help out so you can ignore it even longer?
spend considerably more time on (a) improving my Sanskrit literacy
As per Sanskrit literacy - may I know what do exactly do you want to read?
invent a new markup
Carroway that would provide a tooltip to users, but would leave the 'Carroway' spelling in place
Exactly, kind of ghostword or newEnglish. But as we have German dicitonaries with the same issues, so ghostword could be used?
accept 'Capricornus' as ok, even though it is not among the Enchant English words
Exactly.
Regarding the obscure 'found' words - I can double check and write short explanations.
So glad @AnnaRybakovaT is back - not only beutifull, but smart and hard working she is.
what do you want to read?
For starters, Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses -- I would like to be able to dip into any of these and sight read with ease.
- Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.
@funderburkjim I would like you including Ramayana and Mahabharata as link tagets, which are some of the major ones; and SCH for ls markup, as it goes with pwk and PWG as a set; and then take a break.
I am presently working on SCH and likely to be posting the results, before this month ending.
- words_mw_noneng.txt shows each word and number of instances found in mw.
Dear Jim, Finally I have finished analyzing this file. The results are contained in the file: https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_1.txt
Addendum to Anna's comment of Jan 24, 2022 (Jim) Anna's file was renamed (01-22-2024) to
I am presently working on SCH and likely to be posting the results, before this month ending.
May you never feel weekness.
Finally I have finished analyzing this file. The results are contained in the file:
Absolutely impressed.
Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses
It's good you started with Kale. Indishe Spruch are mostly hard to understand, as is sometimes Bhagavad Gita. Peter's Ramopakhyana is interesting, but still more advanced than Lanman reader stories. It's good you started with Kale.
Good work done, @AnnaRybakovaT; you indeed are a smart worker as @gasyoun mentioned above.
Just seen that there are some missings and errors in your file, and I'm sure @funderburkjim would be reviewing them all over before incorporating them into Cologne files.
Here are a few quick ones-
Galmei 1; nf German word for Calamine
Habush 2; nf a plant name in Bengali; look at the SKD entry चपŕĽŕ¤ˇŕ¤ž.
Mooltan 1; nf a place name (MultÄn)
annumeration 2; nf Addition to a former number (Webster's)
antiphlegmatic 2; nf anti-phlegmatic (used to reduce phlegm)
nonne 1; nf a Latin word used in interrogation
-----------------
Chandoiu 1; nf; //looks like a Sanskrit word// this is a typo for Chandom. (abbr. for Chandomanjari)
Just seen that there are some missings and errors in your file
Thanks a lot for your checking and explanation of missing cases (I had no ideas what it could be)!!!
@funderburkjim
Would you mind regenerating the "latest" iast and deva files for the mw.txt?
I have noticed quite a few issues that need corrections, and thought of doing a complete proofing once for all. This time, I estimate a time-frame of about 6-8 months for the full proofing.
Hope to see your response soon on this.
@drdhaval2785
Would you be interested to do this [as @funderburkjim is either not interested in this proposal, or did not "see" this above post yet (being busy on PWG ls working)]?
Or else, I will take up some other big work for a long term, starting a few days from now.
You want new devanagari files, I can. I am not sure about IAST though.
https://github.com/sanskrit-lexicon/csl-devanagari/blob/main/v02/mw/mw.txt is the latest MW Devanagari version.
In the last file by @AnnaRybakovaT at the https://github.com/sanskrit-lexicon/MWS/issues/127#issuecomment-1020585189, both
Rakshases 2; nf; Rakshasas & Ushases 1; nf; Ushas
are proper in the text, being the plural of Rakshas & Ushas respectively, and no change required in those words. Hope @funderburkjim would take this into account, while he 'works' on this file he has copied elsewhere.
@funderburkjim
I had seen you copying Anna's work after a gap of 6 months; and now another year-and-half has elapsed. Hope you might consider looking into her file and act upon the same, sometime sooner.
@Andhrabharati Am taking up review of words_mw_noneng_1.txt.
Work directory is unique_eng.
For a few old words, these were useful:
For Latin words, sometimes this was useful: https://www.online-latin-dictionary.com/latin-english-dictionary.php
There is a lot of good information in the research by @AnnaRybakovaT and @Andhrabharati. Not clear where to put it so that it may be available when needed another time. Maybe where @drdhaval2785 has put his word studies.
@funderburkjim
Though you have mentioned that (Anna's and) my 'research' contained some good info, you had ignored/skipped this post above.
A quick looking into the 40 print-changes prompted me to comment thus--
cerebralisation 1; nf; cerebralization (typo);; 2024 correction L=110300 niveSa PRINT CHANGE
;; AB there are few more cases of such 's-z' variants-- realization (5) vs. realisation (4); cauterization (4) vs. cauterisation (1) ;; AB these American and British spelling variations may be seen throughout the MW text (see for e.g. courtezan, courtesan) ;; AB thus. I feel that this particular 'print-change' correction is to be reverted back.
Another info, that I wanted to present here--
anum 1; nf; maybe "per annum" (in this case - print change) ;; no change. anum in pw, but otherwise not found
This does not indicate "per annum" as Anna thought; for the context (there are some more places that pw has used "per anum") seems to mean "from/by anus", anum being the inflected form of Anus (Latin word).
@Andhrabharati Revised per your comment(s). For details, see commits above.
I presumed that these two plurals also would/should be marked, as <ns>Aáš
girases</ns>
was.
@Andhrabharati <ns>
markup added. See commits above.
This comment is one branch of #99.
By various means (which I'll describe below tomorrow), a list of 1509 words was developed which