tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

Rename frk -> deu_latf (ISO 639-3, ISO 15924) #59

Closed stweil closed 7 months ago

stweil commented 7 months ago

See related discussion https://github.com/tesseract-ocr/tesseract/issues/4201.

I think we should rename frk to deu_latf not only here, but also in all other Tesseract repositories (langdata, tessdata, tessdata_best, tessdata_fast, tessdoc) because "frk" was never an ISO name.

stweil commented 7 months ago

@amitdo, @egorpugin, do you agree with the suggested renaming for all Tesseract repositories?

stweil commented 7 months ago

@AlexanderP, the planned renaming would also affect Debian and other distributions.

amitdo commented 7 months ago

Yes, I agree.

bertsky commented 7 months ago

I disagree doing it this way (as a rename). Users have known frk for a long time, despite its unfortunate naming. Since there is no actual Frankish model it could conflict with, in the interest of not breaking things for users, I suggest simply adding an alias instead of deleting the old name.

stweil commented 7 months ago

@bertsky, how would you add an alias?

I don't think that many users will experience a breakage by the renaming. German Fraktur is not relevant for most users of Tesseract, and those who use it either depend on tagged models which continue to provide frk, or can fix their workflow by a trivial update.

bertsky commented 7 months ago

how would you add an alias?

I guess the simplest way would be to symlink deu_latf to frk in the repo.

egorpugin commented 7 months ago

I agree, but maybe to set up a symlink first? Do we have similar version tags as in tesseract repo? For example, make a link until tess 6 release, remove symlink and rename after.

If this makes too much burden, just rename like this PR does. It is fine.

stweil commented 7 months ago

I guess the simplest way would be to symlink deu_latf to frk in the repo.

I agree, but maybe to set up a symlink first?

I created the symlink from the old frk to the new deu_latf for tessdata_fast and added a note there in the README. That should be sufficient for distributions and typical users who were always encouraged to use tessdata_fast.

Advanced users who want to run training won't have big problems with replacing frk by deu_latf.

bertsky commented 7 months ago

I created the symlink from the old frk to the new deu_latf for tessdata_fast and added a note there in the README. That should be sufficient for distributions and typical users who were always encouraged to use tessdata_fast.

I cannot see frk anymore. And the branch of this PR is already gone!

What you describe is the wrong direction of the link. I wrote _from deulatf to frk because that's how the old URLs would still work. With the symlink, you can browse on the Github UI and reference the file in a checkout, but not download directly.

stweil commented 7 months ago

I know that you suggested a symbolic link in the different direction, but we want to promote the new name as the standard, not the old one.

bertsky commented 7 months ago

I know that you suggested a symbolic link in the different direction, but we want to promote the new name as the standard, not the old one.

The whole point of having the symlink is to keep the old URLs working. It's not about promoting anything. In your direction, there is no point of having the symlink at all.

stweil commented 7 months ago

Downloading from the main branch is never a good idea unless you are prepared to get different content or changing URLs. Use a tagged release or a branch, for example 4.1.0. https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/4.1.0/frk.traineddata still works. If necessary, we can add more tags or branches.

stweil commented 7 months ago

Information on different distributions:

bertsky commented 4 months ago

@stweil what made you think that all model names in Tesseract must conform to ISO 639-3 in the first place?

What about ita_old, spa_old, kat_old, chi_tra, chi_sim, chi_sim_vert, jpn_vert, deu_frak, dan_frak and so forth?

IMO all the old names should at least be kept for backwards compatibility.