sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

cf. accord. to some #45

Open gasyoun opened 6 years ago

gasyoun commented 6 years ago

← (H2) hirā b [p= 1299] : f. a vein, artery (cf. hitā ami sirā) AV. VS. ; Gmelina Arborea L. [cf. accord. to some, Lat. haru(-spex).] [L=263148]

accord. is tagged as an abbreviation 36 times (to some, to others, to Dhatup, to Say.), but should be nothing, I guess.

Eng. stop, stump. -> Eng. stop, stump. (now stump non-marked = equals Sanskrit word)

There are 896 entries in MW that contain at least a single etymology. I have checked from end, from hlAd to stuB. When issues noted in

will be fixed, I will check all the entries left, @funderburkjim

funderburkjim commented 6 years ago

accord. [as abbreviation]

It is an abbreviation for 'according'. So seems appropriate to tag as <ab>.

gasyoun commented 6 years ago

It is an abbreviation for 'according'. So seems appropriate to tag as .

@SergeA , agree?

SergeA commented 6 years ago

Sure. MW: accord. = according.

BTW, is there a special thread for MW's abbreviations? When I took a look at the headword kṛ for a test, I found there are many abbrs without tool-tipping. Some of them are given in MW list, others are not.

&c (everywhere with typo with lost period) for "&c. " (= etc., Lat. et cetera "and other things, and so on") cl. (MW: = class) + in combination "cl.1 P." the letter "P" is not tool-tipped as Parasmaipada as in other cases -- perhaps some bug of markup. Class. (MW: Classical) ff. (= and the following << Lat. folio) ib. (MW: ibidem or 'in the same place or book or text' as the preceding.) ind.p. (= indeclinable participle) (MW lists separately abbrs ind. & p.) Inf. (MW: infinitive mood) perf. (= perfect) pr. (= present) pr. p. (= present participle) Prec. (MW: precative) reflex. (MW: Reflexive or used reflexively) sg. (= singular number) ss.vv. (MW: s. v. = sub voce, i.e. the word in the Sanskrit order) Ved. (MW: Vedic or Veda) viz. (="namely, in other words" << Lat. videlicet)

In the tool-tips for acc. & abl. is written only "accusative" & "ablative" with lost "case", while for other cases it's written with "case". The word "case" should be added for all.

+ A bug with "A1." instead of "Ā." (for Ātmanepada).

+ A bug of changing the word "see" to qoutation mark "»". I think it should be changed back to "see". The text like "manasā (for °si » above )" is not understandable. Also I've noted this change is inconsistent, as in the text appear both forms: "[see Pāṇ. 3-1, 40 " and "(» Pāṇ. 3-1, 42)".

gasyoun commented 6 years ago

BTW, is there a special thread for MW's abbreviations?

https://github.com/sanskrit-lexicon/MWS/issues/15 , but it's different.

there are many abbrs without tool-tipping.

Great catch, we can add them now.

The text like "manasā (for °si » above )" is not understandable

Agree.

People do not understand as well the XML markup:

funderburkjim commented 6 years ago

special thread for MW's abbreviations

If by 'thread' you mean Github repository, then no. No special thread currently. Appropriate to make such comments within this and/or new issues here in MWS repository.

Here's a brief description of what underlies the abbreviation tooltips in MW. Currently literary source abbreviations are handled in a different subsystem than that used for abbreviations.

There is a table mwab_input which contains abbreviations and abbreviation expansions.

The display program disp.php starts with an xml record for MW and transforms it into HTML. For instance under headword 'praC', part of the xml record is

<c><to/>to_ask_,_question_,_interrogate</c> <p><ab>acc.</ab></p> 

As part of the analysis, the program identifies <ab>acc.</ab> ; it then looks up the contents acc. in the mwab_input table.

acc.    <id>acc.</id> <disp>accusative case</disp>

The text within the <disp> tag is then used to generate a tooltip, which is added to the HTML generated for the praC record.

Note: Based on your suggestion, I've added 'case' to the display text in mwab_input for 'acc.' and 'abl.'

Reasons for no tooltip

Related to the first reason are the cases like 'cl.1.P' or 'A1'. These are typically tagged by <vlex> rather than <ab>.

gasyoun commented 6 years ago

Related to the first reason are the cases like 'cl.1.P' or 'A1'. These are typically tagged by rather than .

Is there any good way out?

funderburkjim commented 6 years ago

Additions to mwab_input

Based on @SergeA examples above, there are several cases where the xml (under 'kf') has appropriate <ab> markup, but there was no corresponding entry in mwab_input. Entries for these have now been added to mwab_input:

ind.p.
Inf.   (lower-case form was present)
perf.
pr.
pr. p.
Prec.
reflex.  (upper case Reflex. was present)
sg.
ss.vv.   (s.v. was present)
viz.

Tooltips now appear for these.

Note: 'viz.' is not properly marked as <ab> . Will deal with this below.

gasyoun commented 6 years ago

Tooltips now appear for these.

Hurray. Let's add the root sign as well

( √kṛ Uṇ. iv, 144)

and L. is unnamed as well

[L=45269]

so is

&c

and

qq. vv. from and gamana, qq. vv.) Bhāshāp. Tarkas. [L=45273]

in

mf(ī)n. not recognised as adjective, it's said only about neuter in tooltip. From

mf(ī)n. (fr. karman ; g. chattrā*di), active, laborious Pāṇ. 6-4, 172. [L=48816]

and

mn. in

mn. ( √ naṭ ; g. ardharcā*di) dancing, acting, a dance L. [L=105293]

and

fr.

mfn. (fr. kṛmi), belonging to a worm Comm. on Uṇ. iv, 121. [L=48836]

in nom. acc. sg. n. the sg. not marked, strange [L=238107] pl. marked, sg. - not, twice

and g.

gara

and pp.

pp. 806 and 807) formed, made, composed (?) RV. v, 45, 6 (others, " fr. √ man ", others, " mātā, mother " ; cf. deva-māta). [L=161699]

SergeA commented 6 years ago

There is a table mwab_input which contains abbreviations and abbreviation expansions.

Some remarks for this list.

acc.    <id>acc.</id> <disp>accusative case</disp>
accord. <id>accord.</id> <disp>according to</disp>

MW: accord. or acc. = according So we have to change acc. = accusative case or according

Br. <id>Br.</id> <disp>Breton</disp> wrong MW: Br. = Brāhmaṇa

Hind. <id>Hind.</id> <disp>Hindi</disp> MW: Hindī

opp. to = opposed to missed


Pra1k.  <id>Pra1k.</id> <disp>Prakrit</disp>
Pra1kr. <id>Pra1kr.</id> <disp>Prakrit</disp>

MW: Prāk. or Prākr. = Prākṛit

SergeA commented 6 years ago

Tooltips now appear for these.

Thanks.

In the kṛ entry "ib." looks green, but without tool-tip. Why?

A big problem with abbreviation is missing separators in the sources. E.g. in the headword gam. to wish to go, be going Lāṭy. MBh. xvi, 63 ; here there must be a semicolon between Lāṭy. & MBh. to strive to obtain ṠBr. x ChUp. ; must be semicolon between "ṠBr. x" & ChUp. And so on. Perhaps there was a processing error in some step. It would be great to return these semicolons back.

Also in the last example there are 2 spaces before "x" and 1 space after, so it is visually closer to the following ChUp. while it belongs to the previous ŚBr. That's also misleading.

funderburkjim commented 6 years ago

Cases with missing <ab> markup

Change transactions to mw.xml will be generated for these.

These changes to mw.xml have been partially installed; they are visible in Basic, List, Adv. Search displays, but not yet in list-0.2 displays.

SergeA commented 6 years ago

In the file mw_orig_utf8.txt the sources look like ‹¯S3Br.…x› ‹¯ChUp.› There is no semicolon between. Don't understand why semicolons were eliminated. But as sources are marked separately, so it is possible to write an algorithm to reintroduce those semicolons for more intelligible output.

funderburkjim commented 6 years ago

Br.

Current markup shows NO instances of <ab>Br.</ab>. Not sure where the "Breton" came from.

I'll go ahead and put Brāhmaṇa in mwab_input, since you find that in MW's abbreviation list.

There are many (1200+) instances of <ls>Br.</ls> Possibly, these should be changed to <ab>Br.</ab> ? Since the literary source for Br. is 073 brāhmaṇa Literary category[BR., Br.], it's probably ok that Br. appears within <ls> tag.

There are 31 instances of <ls>Br.xxx</ls>, such as Br._xiii_,_4_,_11<ab>Sch.</ab></ls>. (example under hw kuRi)

funderburkjim commented 6 years ago

Pra1k.

There are no instances of <ab>Pra1k.</ab> nor of <ab>Pra1kr.</ab> .

Indeed there are no instances of the Pra1k. string at all.

However there 7 cases of Pra1kr.. These are currently coded as

<as0 type="ns">Pra1kr.</as0><as1>Prakrit</as1>   (6 times)
<ls>Pra1kr.</ls>   (1 time).

Since Pra1kr. is one of the published abbreviations, I'm changing these 7 instances to <ab>Prākr.</ab>, and changing mwab_input accordingly.

php parsing problem again

The above changes have been made, but the tooltip doesn;t work (example hw = nAgammA ).

The problem is the way the xml parser handles the text content of <ab>Prākr.</ab> -- it breaks it up into two pieces, Pr and ākr. So the table lookup into mwab_input is foiled.

We encountered this previously with the 'ls' abbreviations in PW, after the reference names were changed to IAST Unicode . Got by this problem somehow in PW case; will use that as a model to walk around this obstacle. Will do this, and continue with the other suggestions, another day.

funderburkjim commented 6 years ago

Abbreviation page from MW

Added image of the the abbreviation page from MW print to this repository: mwabbrev

Unfortunately, this png doesn't open up in a resizeable way. So here is another more useable copy: mwabbreviations

If you click on this image, it opens into a large enough size to read.

gasyoun commented 6 years ago

but not yet in list-0.2 displays

The only one that matters nowadays :o

Don't understand why semicolons were eliminated.

Indeed, everything else is done with such a scrutinity.

possible to write an algorithm to reintroduce those semicolons for more intelligible output.

And add some that should not be, but better than nothing

Not sure where the "Breton" came from.

Preface?

We encountered this previously with the 'ls' abbreviations in PW, after the reference names were changed to IAST Unicode .

Got it.

SergeA commented 6 years ago

After comparison of a fragment of the entry gam with the scan, I noticed:

  1. eliminated commas before sources
  2. eliminated punctuation between sources
  3. two kinds of punctuation between sources: a. semicolon for two independent sources, e.g. Lāṭy. ; MBh. xvi, 63 b. comma for connected sources, e.g. Pāṇ. i, 4, 52, Kāṡ. or Pāṇ. vi, 4, 16, Siddh., where the reference actually points not to Pāṇ. but to the explanations of Kāś. and Siddh. on the Pāṇ. ṣūtras.

So the combination "abbr1 abbr2" in current digitalization corresponds to the original "abbr1 ; abbr2" with two independent sources or to "abbr1 , abbr2", where the second source abbr2 is dependent on the first abbr1. And I'm afraid there is no easy way to restore this lost syntax.

I´ve tried to quote the corresponded text and somehow to mark the places. And I deeply hate this stupid github interface which allows to put tons of emojis but does not allow such a simple thing as coloring the text red. :pig: :small_red_triangle_down: = shows place of commas before source ref. :bangbang: = punctuation between sources

In the file mw_orig_utf8.txt.

(causal…of…the…causal)…¸to…cause…a…person› (•acc.) ‹¸to…go…by…means…of…{jigamiSati}…another›:pig: ‹¯Pa1n2.…1-4…,…52›:bangbang: ‹¯Ka1s3.:› •Desid. #{ji4gamiSati} (‹¯Pa1n2.…,…or›¨#{jigAMsate}:pig:¨‹¯Pa1n2.… 6-4…,…16›:bangbang:¨‹¯Siddh.›¨;¨ •impf.¨#{ajigAMsat}:pig:¨‹¯S3Br.…x)› ‹¸to…wish…to…go…,…be…going› ‹¯La1t2y.› :bangbang: ‹¯MBh.…xvi…,…63› {;}

In the web.

(causal of the causal) to cause a person (acc.) to go by means of jigamiśati another:pig: Pāṇ. 1-4, 52:bangbang: Kāṡ. : Desid. jigamiṣati ( Pāṇ. , or jigāṁsate:pig: Pāṇ. 6-4, 16:bangbang: Siddh. ; impf. ajigāṁsat:pig: ṠBr. x ) to wish to go, be going Lāṭy. :bangbang: MBh. xvi, 63 ;

How it should be.

(causal of the causal) to cause a person (acc.) to go by means of :scissors:jigamiśati:scissors: another:small_red_triangle_down:, :small_red_triangle_down: Pāṇ. i, 4, 52:exclamation:, :exclamation: Kāṡ. : Desid. jigamiṣati ( Pāṇ. , or jigāṁsate:small_red_triangle_down:, :small_red_triangle_down: Pāṇ. vi, 4, 16:exclamation:, :exclamation: Siddh. ; impf. ajigāṁsat:small_red_triangle_down:, :small_red_triangle_down: ṠBr. x ) to wish to go, be going Lāṭy. :exclamation:; :exclamation: MBh. xvi, 63 ;

The marked word jigamiśati is a typo and must be cut out.

Also I found that somebody changed the style of numbering of Pāṇ. chapters from Roman_number_with_comma to Arabic_number_with_hyphen.

SergeA commented 6 years ago

Current markup shows NO instances of Br..

I think MW included this Br. in the list because it is widely used in combinations like ŚBr. etc.

Not sure where the "Breton" came from.

From a typo for the next line giving Bret. = Breton. (But it seems that Bret. is not used in the text.)

funderburkjim commented 6 years ago

The only one that matters nowadays

There is a practical reason for the 'partial installation' that you may not be aware of. The Basic and other displays depend on a database which is essentially the same as X.xml; it is quite efficient to put X.xml into a sqlite table.

By contrast, the various list-0.2 displays depend on a different database. An entry in this database contains the HTML form. In other words, the display function used in the Basic displays is applied to every element of X.xml, and the resulting HTML is saved in the database.

In the case of MW, the HTML construction process is quite complex (disp.php is complex), so this HTML construction step takes 15-30 minutes on the Cologne server.

Thus, when, as now, we are making many changes to the disp.php function for MW, it is not practical to rerun the HTML construction step with every change.

When the current flurry of changes to MW display is concluded, then that will be the time to remake the HTML database that the list-02 displays depend on.

Please be patient.

funderburkjim commented 6 years ago

more abbreviations and tooltips

Prākr. tooltip now works.

cl., P., A1. now show tooltips and instead of A1 we see Ā. In mw.xml, these appear in <vlex> tag.

Replaced » with See. (no tooltip involved here). In mw.xml this represented as <see/>. It was that » in early coding.

Tooltips added for

In displays, added semicolons between contiguous literary sources. The mw_orig_utf8.txt file that @SergeA references is from Thomas, and what we started with back 2005-6. Since his markup separates the literary source elements, he must have thought the semicolons were superfluous. Anyway, they are back in the displays now, and it is an improvement.

The more subtle distinctions of comma v. semicolon v. period that SergeA mentions could be done case by case; but I don't see any practical algorithmic solution. However, Perhaps the notion of 'dependent sources' could be formalized to some extent . Eg. given <ls>X</ls> <ls>Y</ls>, if X and Y are dependent then insert a comma. One dependency rule would be X = Y (<ls>Pāṇ.</ls> , <ls>Pān.</ls>). A table of other dependent X and Y would probably catch many cases. We could give this a try sometime if it seems worth the effort.

The ( Pāṇ. , or jigāṁsate🔻, 🔻 Pāṇ. vi, 4, 16❗️, example shows another wrinkle, in that the missing comma is not separating contiguous references, due to the intervention phrase or jigāṁsate.

I've gotten rid of the gratuitous jigamiśati.

funderburkjim commented 6 years ago

somebody changed the style of numbering of Pāṇ. chapters from Roman_number_with_comma to Arabic_number_with_hyphen.

somebody = Thomas, as mw_orig_utf8.txt indicates. One speculation is that he thought the Arabic number form was more consistent with current references; (e.g. Katre Pāṇ. 6.4.16) It would likely be possible to programmatically change these references back to a form close to that which MW uses.

funderburkjim commented 6 years ago

In the kṛ entry "ib." looks green, but without tool-tip. Why?

ib. is systematically marked in mw.xml as <ls>ib.</ls> ; but it isn't included in the table of MW authorities, so it reverts to green like the numbered parts (chapter, verse, etc) of other <ls> .

In the recent changes to disp.php, I programmatically turned this into an abbreviation so it would get a tooltip; this is the current rendering.

funderburkjim commented 6 years ago

make a list of currently marked abbreviations

Since we're focusing on abbreviations and tooltips, I think we need to uncover things like 'opp. to' which appears in the printed list of abbreviations, but previously was not marked as an abbreviation. I'll do this soon and we can then compare the actual list to the printed list, and add markup and tooltips as needed.

gasyoun commented 6 years ago

I'll do this soon and we can then compare the actual list to the printed list, and add markup and tooltips as needed.

So be it.

HTML construction step takes 15-30 minutes on the Cologne server.

Oh, so it's harder than I thought. Can you make video recordings of what you do, please? So we understand.

Eg. given X Y, if X and Y are dependent then insert a comma.

Makes sense. Pāṇ. and Siddh. are connected, so we could make list of possible connections.

We could give this a try sometime if it seems worth the effort.

Worth a try.

It would likely be possible to programmatically change these references back to a form close to that which MW uses.

That makes sense.

SergeA commented 6 years ago

The ( Pāṇ. , or jigāṁsate🔻, 🔻 Pāṇ. vi, 4, 16❗️, example shows another wrinkle, in that the missing comma is not separating contiguous references, due to the intervention phrase or jigāṁsate.

In this example Desid. jigamiṣati ( Pāṇ. , or jigāṁsate, Pāṇ. vi, 4, 16, Siddh. ; impf. ajigāṁsat, ṠBr. x ) the source is given as a supplement to the word-form. So it reads as "jigamiṣati by Pāṇ.´s rules or alternative jigāṁsate by Siddh. comm. to Pāṇ." The comma in jigamiṣati ( Pāṇ. , or jigāṁsate separates alternative word-forms jigamiṣati & jigāṁsate. And in jigāṁsate, Pāṇ. vi, 4, 16, Siddh. & ajigāṁsat, ṠBr. x the lost commas after word-forms separate them from the additional source reference. All commas before sources are eliminated. But not all sources have comma before, as there must be no comma in the case of ; abbr or (abbr or in abbr etc.

Here is an example of a case where two independent sources are complicated with additional word-forms in brackets. to visit, RV. x, 41, 1 (p. ganigmat) ; VS. xxiii, 7 (impf. aganīgan) ; Here eliminated comma before sources separate them from the word-meaning. This comma is not very crucial and its absence does not impede the understanding. And the eliminated semicolon separate two sources RV.& VS. And without it, it becomes not clear if in RV. x, 41, 1 (p. ganigmat) VS. xxiii, 7 the word-form is quoted from RV. or from VS.

what we started with back 2005-6

It´s pity we have no files from more early steps. If there´d be a file with those original commas and semicolons, perhaps it would be possible to make some wise algorithm which could restore all this punctuation from comparison of two files.

if X and Y are dependent then insert a comma

Yes, it can work, but only partially, for the more frequent depended sources.

gasyoun commented 6 years ago

Yes, it can work, but only partially, for the more frequent depended sources.

Let's give the restoration a try. I agree it was a step unwise at that time to kill punctuation.

funderburkjim commented 6 years ago

It is a pity ...

We DO have an earlier version, which I just remembered. It is the one used in this display: http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

Acc. to the html page, this data is as of 1997. It obviously is in a simpler form, and I think maybe the missing punctuations are present.

This version played no direct part (until perhaps now) in the current mw.xml.

Here is a curl command to download the underlying data file: mwd.txt (about 25MB):

curl -o mwd.txt http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/dat/mwd.txt

You can get to 'gam' by

gasyoun commented 6 years ago

Acc. to the html page, this data is as of 1997. It obviously is in a simpler form, and I think maybe the missing punctuations are present.

Hurray. As far as I can see the punctutation is as should be, see MBh. Ragh. ii , 15 ; xii , 7 ; cf. Pa1n2. 2-3 , 12]

<begin>mwd
<hl>gam</hl>
<st>gam</st>
<en>1 Ved. cl. 1. P. %{ga4mati} (Naigh. ; Subj. %{gamam} , %{ga4mat} [%{gamAtas} , %{gamAtha} AV.] , %{gamAma} , %{gaman} RV. ; Pot. %{game4ma} RV. ; inf. %{ga4madhyai} RV. i , 154 , 6): cl. 2. P. %{ga4nti} (Naigh. ; Impv. 3. sg. %{gantu} , [2. sg. %{gadhi} see %{A-} , or %{gahi} see %{adhi-} , %{abhy-A-} , %{A-} , %{upA7-}] , 2. pl. %{ga4ntA} or %{gantana} RV. ; impf. 2. and 3. sg. %{a4gan} [RV. AV.] , 1. pl. %{a4ganma} [RV. AV. ; cf. Pa1n2. 8-2 , 65] , 3. pl. %{a4gman} RV. ; Subj. [or aor. Subj. cf. Pa1n2. 2-4 , 80 Ka1s3.] 1. pl. %{ganma} , 3. pl. %{gma4n} RV. ; Pot. 2. sg. %{gamyAs} RV. i , 187 , 7 ; Prec. 3. sg. %{gamyA4s} RV. ; pr. p. %{gma4t} , x , 22 , 6): cl. 3. P. %{jaganti} (Naigh. ii , 14 ; Pot. %{jagamyAm} , %{-yAt} RV. ; impf. 2. and 3. sg. %{ajagan} , 2. pl. %{ajaganta} or %{-tana} RV.): Ved. and Class. cl. 1. P. (also A1. MBh. &c.) , with substitution of %{gacch} [= $] for %{gam} , %{ga4cchati} (cf. Pa1n2. 7-3 , 77 ; Subj. %{gA7cchAti} RV. x , 16 , 2 ; 2. sg. %{gacchAs} [RV. vi , 35 , 3] or %{gacchAsi} [AV. v , 5 , 6] ; 2. pl. %{gacchAta} RV. viii , 7 , 30 ; 3. pl. %{ga4cchAn} RV. viii , 79 , 5 ; impf. %{a4gacchat} ; Pot. %{gacchet} ; pr. p. %{ga4cchat} RV. &c. ; aor. %{agamat} Pa1n2. 3-1 , 55 ; vi , 4 , 98 Ka1s3. ; for A1. with prepositions cf. Pa1n2. 1-2 , 13 ; 2nd fut. %{gamiSyati} AV. &c. ; 1st fut. %{ga4ntA} [Pa1n2. 7-2 , 58] RV. &c. ; perf. 1. sg. %{jagamA} [RV.] , 3. sg. %{jagAma} , 2. du. %{jagmathur} , 3. pl. %{jagmu4r} RV. &c. ; p. %{jaganva4s} [RV. &c.] or %{jagmivas} Pa1n2. 7-2 , 68 f. %{jagmu4SI} RV. &c. [347,1] ; Ved. inf. %{ga4ntave} , %{ga4ntavai4} ; Class. inf. %{gantum}: Ved. ind. p. %{gatvAya} , %{gatvI4} ; Class. ind. p. %{gatvA4} [AV. &c.] , with prepositions %{-gamya} or %{-gatya} Pa1n2. 6-4 , 38) to go , move , go away , set out , come RV. &c. ; to go to or towards , approach (with acc. or loc. or dat. [MBh. Ragh. ii , 15 ; xii , 7 ; cf. Pa1n2. 2-3 , 12] or %{prati} [MBh. R.]) RV. &c. ; to go or pass (as time e.g. %{kAle@gacchati} , time going on , in the course of time) R. Ragh. Megh. Naish. Hit. ; to fall to the share of (acc.) Mn. &c. ; to go against with hostile intentions , attack L. ; to decease , die Ca1n2. ; to approach carnally , have sexual intercourse with (acc.) A1s3vGr2. iii , 6 Mn. &c. ; to go to any state or condition , undergo , partake of , participate in , receive , obtain (e.g. %{mitratAM@gacchati} , `" he goes to friendship "' i.e. he becomes friendly) RV. AV. &c. ; %{jAnubhyAm@avanIM-gam} , `" to go to the earth with the knees "' , kneel down MBh. xiii , 935 Pan5cat. v , 1 , 10/11 ; %{dharaNIM@mUrdhnA-gam} , `" to go to the earth with the head "' , make a bow R. iii , 11 , 6 ; %{ma4nasA-gam} , to go with the mind , observe , perceive RV. iii , 38 , 6 VS. Nal. R. ; (without %{ma4nasA}) to observe , understand , guess MBh. iii , 2108 ; (especially Pass. %{gamyate} , `" to be understood or meant "') Pa1n2. Ka1s3. and L. Sch. ; %{doSeNa} or %{doSato-gam} , to approach with an accusation , ascribe guilt to a person (acc.) MBh. i , 4322 and 7455 R. iv , 21 , 3: Caus. %{gamayati} (Pa1n2. 2-4 , 46 ; Impv. 2. sg. Ved. %{gamayA} or %{gAmaya} [RV. v , 5 , 10] , 3. sg. %{gamayatAt} AitBr. ii , 6 ; perf. %{gamayA4M@cakAra} AV. &c.) to cause to go (Pa1n2. 8-1 , 60 Ka1s3.) or come , lead or conduct towards , send to (dat. AV.) , bring to a place (acc. [Pa1n2. 1-4 , 52] or loc.) RV. &c. ; to cause to go to any condition , cause to become TS. S3Br. &c. ; to impart , grant MBh. xiv , 179 ; to send away Pa1n2. 1-4 , 52 Ka1s3. ; `" to let go "' , not care about Ba1lar. v , 10 ; to excel Prasannar. i , 14 ; to spend time S3ak. Megh. Ragh. &c. ; to cause to understand , make clear or intelligible , explain MBh. iii , 11290 VarBr2S. L. Sch. ; to convey an idea or meaning , denote Pa1n2. 3-2 , 10 Ka1s3. ; (causal of the causal) to cause a person (acc.) to go by means of {jigamiSati} another Pa1n2. 1-4 , 52 Ka1s3.: Desid. %{ji4gamiSati} Pa1n2. , or %{jigAMsate} Pa1n2. 6-4 , 16 Siddh. ; impf. %{ajigAMsat} S3Br. x) to wish to go , be going La1t2y. MBh. xvi , 63 ; to strive to obtain S3Br. x ChUp. ; to wish to bring (to light , %{prakA4zam}) TS. i: Intens. %{ja4Gganti} (Naigh.) , %{jaGgamIti} or %{jaGgamyate} (Pa1n2. 7-4 , 85 Ka1s3.) , to visit RV. x , 41 , 1 (p. %{ga4nigmat}) VS. xxiii , 7 (impf. %{aganIgan}) ; [cf. $ ; Goth. {qvam} ; Eng. {come} ; Lat. {venio} for {gvemio}.] </en>
SergeA commented 6 years ago

maybe the missing punctuations are present

Not at all. :( The only thing we can get from this is the fact that 20 years ago punctuation was already spoiled.

As far as I can see the punctutation is as should be, see MBh. Ragh.

Between Mbh. & Ragh. should be semicolon!

gasyoun commented 6 years ago

20 years ago punctuation was already spoiled.

That means it was always there in such a form, we're doomed. In this case, interlinking sources would turn out to be a catastrophe.

SergeA commented 6 years ago

That means it was always there in such a form

Wrong. The first stage was OCR, and only then followed that buggy tagging. The first version of Cologne web-search of 2003 is corrupted, and the soft version of Louis Bontes of 2001 is corrupted too. Perhaps the basic version of 1997 copyrighted by Thomas Malten was already corrupted. But still there is a feeble hope that somewhere are kept files of previous intermediate backups.

interlinking sources would turn out to be a catastrophe

Most sources are given without specification of the referred line. Even without pointing the exact edition of the text. So it is catastrophe in any case.

gasyoun commented 6 years ago

Most sources are given without specification of the referred line.

But the reference in many cases is given before. So there is some hope, still.

funderburkjim commented 6 years ago

Louis Bontes of 2001

This also is based on Thomas's digitization. AFAIK, there is no digitization of MW independent of that done by Thomas and his group.

This article was the first (and only) description that Thomas made of the MW digitization; it was written in 1997. So probably the tamil-website dataset of 1997 mentioned above is quite close to the original form of the digitization.

gasyoun commented 6 years ago

This also is based on Thomas's digitization. AFAIK, there is no digitization of MW independent of that done by Thomas and his group.

Sure. Just a joke: the creator of https://sourceforge.net/projects/sandic/ (similar to what Bontes did - an offline UI, @SergeA still uses Bonte, because it's quicker and has Advanced search features like suffixes) said that there are so many errors in MW in 2016 that he will rescan the book. How foolish can be such a statement only Thomas and Jim know.

SergeA commented 6 years ago

This also is based on Thomas's digitization. AFAIK, there is no digitization of MW independent of that done by Thomas and his group.

Yes, of course. I'm just trying to determine the earliest version of the available digitalization files. The file of Bontes is modified in 2001. While the file MONIER.ALL I've downloaded from the site is modified in 2010. Perhaps it is the same as MONIER.ALL file mentioned in the 1997 report. But there is not 100% certitude.

So probably the tamil-website dataset of 1997 mentioned above is quite close to the original form of the digitization.

In the "tamil" site is stated only copyright 1997, but is nothing said if the files were somehow changed or not after 1997.

SergeA still uses Bonte, because it's quicker and has Advanced search features

Yeah, despite all the typos, after all this time this application is still the quickest and easiest for me.

SergeA commented 6 years ago

tapas Erroneous tagging of the semicolon between sources, it's treated as word-meaning separator.

[p= 437,1] [L=82702]    religious austerity, bodily mortification, penance, severe meditation, special observance (e.g. " sacred learning " with Brahmans, " protection of subjects " with kṣatriyas, " giving alms to Brahmans " with vaiśyas, " service " with śūdras, and " feeding upon herbs and roots " with ṛṣis Mn.  xi, 236 ) RV.  ix, 113, 2
[p= 437,1] [L=82703]    x (personified, 83, 2 f. & 101, 1 , " father of manyu " RAnukr.  ) AV.  &c.

original: ... RV. ix, 113, 2 ; x (personified, 83, 2 f. & 101, 1 ... Where are referenced RV. ix.113.2 + x.83.2_&_the_following + x.101.1. A very intricate way to write references!

And BTW the last ref. x.101.1 seems to be wrong, I did not find the word in this RV verse. But we can't do anything about, can we?

gasyoun commented 6 years ago

But we can't do anything about, can we?

We can at least note it in the factual mistakes .txt file.

SergeA commented 6 years ago

taporati - 3 problems

(H3) tapo--rati [p= 437,3] [L=82800] mfn. id., i, 1838 (H3B) tapo--rati [p= 437,3] [L=82801] m. N. of a son of manu tāmasa Hariv. 429 (H3B) tapo--rati [p= 437,3] [L=82802] m. = -ravi VP. iii, 2, 34.

  1. The proper name Manu Tāmasa is written all in lower-case. This is a processing bug which should be corrected.

  2. The abbr. "id." refers to the meaning of the previous word.

(H3) tapo--rata [p= 437,3] [L=82799] mfn. rejoicing in religious austerity, pious MBh. i, 36, 3

This is a big problem of the digitalization. In the original continuous text this was simple to read the referred meaning from the previous line. But in the digitalization there is no previous line, because it pertains to another headword and is not shown. So the word meaning becomes meaningless. There is highly need of providing this lost info.

  1. Here we have a strange source reference i, 1838 without naming the source. I supposed the source is the same as for the referred previous meaning, i.e. MBh.

(H3) tapo--rata [p= 437,3] [L=82799] mfn. rejoicing in religious austerity, pious MBh. i, 36, 3

And bingo! But actually the Calcutta ed. 1st parvan, line 1838 gives reading ते पारतं (!) , which is print error for तपोरतं, from the stem taporata (!), not taporati. So here we have triple error: in the MBh. text, in referring to wrong word and in forgiving to point the source name, which should be "MBh." But I do not know if it is a print error or maybe MW supposed to take the source MBh. from the previous line, which of course is not correct way to refer sources, but perhaps somewhere there can be some similar case.

Yet one more peculiarity here is that the MBh. is quoted from different (!) editions:

gasyoun commented 6 years ago

Calcutta ed. 1st parvan

Please dive a link, it's archive.org, I suppose?

triple error

@SergeA adores such cases.

MBh. is quoted from different (!) editions

Yes, I guess we should note it in the abbreviations list. 1.1838 and 1.36.3 can be an easy read as which exactly is meant for.

SergeA commented 6 years ago

Mahabharata Calcutta ed. vol.1 1834 https://books.google.ru/books?id=tNJCAAAAcAAJ vol.2 1836 https://books.google.ru/books?id=5ye6Eo3J9ywC vol.3 1837 https://books.google.ru/books?id=bDW-NI_EXOIC vol.4 1839 https://books.google.ru/books?id=MI9WPBKoDDkC

SergeA commented 6 years ago

Another example of an inherited source reference.

(H2B) tapta [p= 437,3] [L=82826] n. hot water ṠBr. xiv, 1, 1, 29

(H2C) taptam [p= 437,3] [L=82827] ind. in a hot manner, xi, 2, 7, 32 .

For the headword taptam the source name ŚBr. is omitted, because it was mentioned in previous line for tapta.

SergeA commented 6 years ago

taptam - is missing in the list mode. By the word order it should appear between tapta and taptakumbha. A bug.

SergeA commented 6 years ago

Ṡ ṡ vs. Ś ś In the MW source references like Suṡr. , ṠārṅgS instead of Ṡ ṡ (with overdot, Monier style) better be used Ś ś (with acute, according to IAST).

SergeA commented 6 years ago

tamasvan - why accent & gender marks are reversed? Original:

Támas ... --van (tám°), mf(arī)n. = -vat

Old file (order as in original, but lost comma while tagging):

<H3>101{tamasvan}3{ta4mas--van}¦ (#{ta4m°}) •mf(#{arI})n. »= #{-vat}

Current representation (order reversed):

(H3) támas--van [p= 438,1] [L=82904] mf (arī) n. (tám°) = -vat

gasyoun commented 6 years ago

In the MW source references like Suṡr. , ṠārṅgS instead of Ṡ ṡ (with overdot, Monier style) better be used Ś ś (with acute, according to IAST).

Agree.

gasyoun commented 6 years ago

haṃsa-padī : f. N. of various plants (accord. to L. "a species of Mimosa and Cissus Pedata ") Car. [L=260114]

L. a clear case that L. is not only Lexicographers, but also Linn.

SergeA commented 6 years ago

a clear case that L. is not only Lexicographers, but also Linn.

A clear case where MW for botanical identification refers to the opinion of old Indian sources (L. = Lexicographers!), and not to modern investigations. Wilson even provides the name of this lexicographical text - Ratna Mālā. On the other hand it´s quite improbable that we could receive an explanation of the word haṃsa-padī from Swedish botanist Carl Linné, who has never been in India nor did interested in Sanskrit.

gasyoun commented 6 years ago

Wilson even provides the name of this lexicographical text - Ratna Mālā

Well done, as usual.

guru

In guru threre is superl. gariṣṭha, gurutama, where superl. is left without a tooltip.

and comp. gárīyas is not a compound, wrong tooltip.