SKD digitization (Devanagari version)

funderburkjim commented 3 years ago

@Shalu411 Hi!

I've made a version of the digitization of skd that you requested. [Did you think I had forgotten?]

Currently, the version is a sample of the first 10,000 lines, and it is skd_deva_sample.txt

Take a look, and see if this sample is what you were requesting. Or, tell me of any problems.

When you give the go-ahead, I'll generate the whole dictionary in a similar way.

Shalu411 commented 3 years ago

Namaste Jim

[Did you think I had forgotten?]

You and forget!!? Not even in dream! I am glad we have it at the right time. Thanks you so much. I have seen the sample. It should do!

Now how to note down the error? For Eg- I see one right here (starred)

21-001अअ अ¦, व्य, अभा*बः* । अल्पः । There should be अभा*वः* What is the format to make it? Please tell with one example. Thanks --Shalu

drdhaval2785 commented 3 years ago

One friendly advice @Shalu411 . Don't try to change b / v errors. Otherwise you will end up writing SKD and VCP afresh.

gasyoun commented 3 years ago

Don't try to change b / v errors.

Ignoring them is not a good idea as well. But there must be thousands of them.

end up writing SKD and VCP afresh.

For our digital purpose it might be not that bad idea at all - at least at an alternate headword level. Maybe generate just all words with v as b and vice versa, @drdhaval2785 ?

Shalu411 commented 3 years ago

Hariom. Ok. .Assuming it is a mistake I have to note down- Can I note the errors this way? Method 1) Give the whole technical detail of the word- <L>2<pc>1-001<k1>अ<k2> अभाबः >> अभावः

OR this - Method 2) just with the LCode? L=2 अभाबः >> अभावः Please guide me..

How is Sampada doing it?

@drdhaval2785 Can you please provide me the list of suspicious head-words / words in SKD? Thanks

funderburkjim commented 3 years ago

Please tell with one example.

An error file: skd_error.txt

In preparation, make an 'skd_error.txt' file where the changes are detailed. Within skd_error.txt, make a line for each change. The format of such a line woud be, by example, 2:अ:अभाबः अभावः

There are 4 fields separated by colons, almost like your example. The 4 fields are:

L-code (the cologne record number)
k1 value the headword
old : The word that needs to be corrected
new : The correction

If you want to make a comment in skd_error.txt file, insert one or more lines after the above 4-field correction line, and start each of the comment lines with a semicolon. You can add extra blank lines if you want.

These formatting details are consistent with the xxx_error1.txt files that Sampada and Anna have been using.

change the digitization

You should also change the digitization directly (currently, for this preliminary trial, this digitization file is named skd_deva_sample.txt).

So incorporate the changes directly.

funderburkjim commented 3 years ago

अभाबः -> अभावः

I am very much in favor of this change. I think @Shalu411 is experienced enough in Sanskrit to make a reliable judgment in such cases.

There are 3 sitations that might have led to 'aBAbaH' in the skd digitization:

The scanned image clearly shows 'b' and the typist who did the digitization accurately entered 'b'
The scanned image clearly shows 'v' and the typist erroneously entered 'b'
The scanned image is unclear, and the typist entered 'b'.

In the present case, I would say case 2 applies:

@shalu411 If you go to the trouble of examining the scanned image, and happen to notice a case of type '1' (i.e. a case where your change definitely disagrees with the scan), then you should make a comment in skd_error.txt of the form '; scan error'.

However, I am not saying that you should examine the scanned image in every change, as this extra scanned image examination may be more time-consuming than it is worth.

funderburkjim commented 3 years ago

@Shalu411

Did you clone the SKD repository? Are you using git or Github desktop?

gasyoun commented 3 years ago

2:अ:अभाबः अभावः

Oh, these visargas that look like :.

Did you clone the SKD repository?

Not yet, she will need my help.

Github desktop

She will. Let the whole converted file come?

In the present case, I would say case 2 applies

Agree.

funderburkjim commented 3 years ago

Oh, these visargas that look like :

Good point. Maybe use '#' instead?

Let the whole converted file come?

Let's work a while with the sample file that is there. Once the procedural steps are ironed out,
we can go to a full skd_deva.txt.

she will need my help.

Thanks!

gasyoun commented 3 years ago

Good point. Maybe use '#' instead?

Let us give '#' a try?

Once the procedural steps are ironed out, we can go to a full skd_deva.txt.

Sure, so be it.

gasyoun commented 3 years ago

Usha pulled an update from Github Desktop. Is it as it should be @drdhaval2785 @funderburkjim ?

https://github.com/sanskrit-lexicon/SKD/commit/e8686f3d84c6a26d47b8ed7c0b0ac217ce71224a

Shalu411 commented 3 years ago

Hariom Hearty Thanks Mark, for the support and guidance. Jim, once you confirm, I am ready for carrying on with the corrections.

funderburkjim commented 3 years ago

Usha: I confirm that you pushed properly; I can see the 1 change you made.

BUT please wait for making further changes.

I am having problems with inverting the Devanagari back to slp1, and need to get that problem ironed out . Will aim for solving this problem tomorrow. The problem relates to the candrabindu when it is after an 'o' but is not the Om character. ॐ In slp1, o~ is supposed to represent ॐ.

But there are several instances in the skd digitizations like under ठोँट under headword aDaraH that also have 'o~' in slp1. These are what are causing the problems at the moment.

Shalu411 commented 3 years ago

Namaste BUT please wait for making further changes. Sure Jim! @drdhaval2785 Can you help with the o~ issue?

gasyoun commented 3 years ago

But there are several instances in the skd digitizations like under ठोँट under headword aDaraH that also have 'o~' in slp1. These are what are causing the problems at the moment.

As rare as it can get. Jim, you are our fortress.

funderburkjim commented 3 years ago

As rare as it can get

The was discovered by applying a principle of invertibility. Here,

our base digitization is in SLP1 spelling : skd.txt
A conversion was made to use Devanagari spelling: skd_deva.txt
To incorporate the changes Shalu makes to skd_deva.txt, we need to convert skd_deva.txt back to skd_slp1.txt.
And it should be that if NO changes were made to skd_deva.txt, then the round trip skd.txt -> skd_deva.txt -> skd_slp1.txt should result in skd.txt identical to skd_slp1.txt. The problem was noticed while investigating WHY, in earlier version of transcoding, skd.txt was NOT same as skd_slp1.txt.

This problem now has a satisfactory solution.

You can see that skd_deva_sample was changed in two lines from what Usha had, by looking at this commit difference.

funderburkjim commented 3 years ago

@Shalu411

Ready for you to pull this repository and continue with changes to skd_deva_sample.txt.

ALSO, I made a file 'skd_error.txt' where you should document simple changes, such as the first 'aBAbaH' one you made.

By 'simple change', I mean spelling errors like 'aBAbaH'.

More complex errors (like missing a headword which you mentioned elsewhere) will need to have special handling -- meaning that probably I need to do the actual change to skd_deva_sample.txt rather than you for the complex errors.
You can describe such complex cases as comments in the skd_error.txt file.

gasyoun commented 3 years ago

Ready for you to pull this repository and continue with changes to skd_deva_sample.txt.

Good news for India.

sanskrit-lexicon / SKD

SKD digitization (Devanagari version) #11

An error file: skd_error.txt

change the digitization