sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Some way to handle sanhw2.txt #331

Closed drdhaval2785 closed 3 years ago

drdhaval2785 commented 7 years ago

Today I updated my local copy of this repository with online copy. sanhw1.txt had only 22 changes as against it sanhw2.txt had 105102 changes.

With every commit, so many changes in sanhw2.txt will inflate the size of git repository exponentially, because git keeps track of all lines changed between commits. @funderburkjim may like to pay attention to this detail. Earlier (before 10 months or so) L-number were treated as almost fixed. So change in sanhw2.txt and sanhw1.txt were almost coterminus.

Now L-numbers change more than the headwords.

Any reason behind it, Jim?

Or maybe something to do with the alternate headword addition ? I am not sure.

funderburkjim commented 7 years ago

Interesting observation.

My suspicion is that a very small headwords have been inserted. For instance, I know that GRA had (a few months ago) several headword insertions, due to the inputs via correction form from a particular (anonymous) user.

GRA has about 10,000 headwords.

Let's suppose that tomorrow we find that there is a headword missing after the current 1,000th headword. We submit the correction and recompute the headwords INCLUDING L-numbers for GRA; now there are 10,001 headwords and the L-numbers of 9000 headwords have changed (been incremented by 1).

That would flow through to sanhw2 as 9000 lines changing.

At the moment, I do not think of another dictionary with recent headword insertion/deletion..

Do this 10 or 15 times for Grassman , and that would come close to accounting to the 100,000+ changes you notice in sanhw2.

So, it's not that L-numbers are wildly changing all over the place. Rather, it is that sanhw2 amplifies the changes.

sanhw2.txt has about 430,000 lines and is about 16MB in size. So, the 100,000 changes is roughly 25% of the lines, or about 4MB (uncompressed). This doesn't seem like enough bytes to be particularly worried about.

Incidentally, I have never made use of sanhw2.

funderburkjim commented 7 years ago

@drdhaval2785 I don't know enough about Git to be able to reproduce your 105102 changes. How did you discover that number?

gasyoun commented 7 years ago

GRA had (a few months ago) several headword insertions

So it's not finalized yet, L's can change, a pity.

Do this 10 or 15 times for Grassman , and that would come close to accounting to the 100,000+ changes you notice in sanhw2.

Amazing numbers.

Incidentally, I have never made use of sanhw2.

I'll be the first one, as it contains the accents. I use them, so let the file not die.

funderburkjim commented 7 years ago

so let the file not die.

OK.

drdhaval2785 commented 7 years ago

@drdhaval2785 https://github.com/drdhaval2785 I don't know enough about Git to be able to reproduce your 105102 changes. How did you discover that number?

when I did git pull origin master from my terminal, It gives me the summsry of lines changed in various files. That had this numbers.

gasyoun commented 7 years ago

@funderburkjim did it help?

funderburkjim commented 7 years ago

@drdhaval2785 and @gasyoun

Yes, I 'git' it now :) Thanks!

gasyoun commented 7 years ago

Oh, ok. Any solution or resolution?

drdhaval2785 commented 3 years ago

Now lnums are constant thanks to meta lines. Issue closed.