retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.28k stars 284 forks source link

[Bug/Feature]: Arabic script letters in citation keys #2403

Closed Serena-UB closed 1 year ago

Serena-UB commented 1 year ago

Debug log ID

LFZ25M2F-refs-euc

What happened?

Hi! I originally posted this on Zotero forums, and was asked to open an issue for it here.

I hope someone can help me with this:

Something strange happens to the citation keys of my Zotero entries in Arabic script. Some of the Arabic letters are automatically converted to Latin characters, but some are converted to question marks.

My citation key formula is auth+year.

An example:

The article author is هناء زايد عباس

The citation key becomes this:

hn��z�ydaab�s2020

Why are some of the Arabic characters converted and others not? If I could access the convertion table for this and modify it, it would be great.

The characters not being read, are ا and ء .

github-actions[bot] commented 1 year ago

It looks like you did not upload an debug log. The debug log is important; it gives @retorquere your current BBT settings and a copy of the items under consideration as a test case so he can best replicate your issue, or build towards the desired behavior. Without it, @retorquere is effectively blind. debug logs are useful for both analysis and for enhancement requests; in the case of export enhancements, I need the copy of the references you have in mind.

If you did try to submit a debug log, but the ID looked like D<number>, that is a Zotero debug report, which I cannot access. Please re-submit a BBT debug log by one of the methods below.

This request is much more likely than not to apply to you too @Serena-UB, even if you think it unlikely. Please trust @retorquere when he says he will usually need one; he will more often than not just end up saying "please send a debug log". Let's just skip over the unnecesary delay this entails. Sending a debug log is very easy:

  1. If your issue relates to how BBT behaves around a specific reference(s), such as citekey generation or export, select at least one of the items(s) under consideration, right-click it, and submit an BBT debug log from that popup menu. If the problem is with export, please do include a sample of what you see exported, and what you expected to see exported for these references.

  2. If the issue does not relate to references and is of a more general nature, generate an debug log by restarting Zotero with debugging enabled (Help -> Debug Output Logging -> Restart with logging enabled), reproducing your problem, and selecting "Send Better BibTeX debug report..." from the help menu.

Once done, you will see a debug ID in red. Please post that debug id in the issue here.

Thank you!

Serena-UB commented 1 year ago

LFZ25M2F-refs-euc

retorquere commented 1 year ago

Would hnAzAydEbAs be a reasonable romanization for هناء زايد عباس?

Serena-UB commented 1 year ago

In this case, the alif, "ا" is rendered "A" and the ayn "ع" is rendered E, which works well (I didn't know the citation keys were case sensitive!), but the hamza, "ء" is left out.

The letter "ء" is often rendered either with unicode character 02BE ( ʾ ), or some kind of accent or backtick. Popularly, it is also romanized with the number 2. If it should be a letter, I suggest "x", because on the Arabic keyboard, the "ء" is in the same spot as the "x" on a qwerty keyboard.

retorquere commented 1 year ago

I use jsingua for the romanization, do you think you can discuss with the author of that library what the best choices are? I don't speak/read any arabic.

Serena-UB commented 1 year ago

Do you mean with the author of jslingua?

retorquere commented 1 year ago

Yes. I am not in a position to judge whether the changes you want make broader sense. @kariminf will get you a much better dialogue.

Serena-UB commented 1 year ago

Ok, I will ask him!

retorquere commented 1 year ago

The letter "ء" is often rendered either with unicode character 02BE ( ʾ ), or some kind of accent or backtick. Popularly, it is also romanized with the number 2. If it should be a letter, I suggest "x", because on the Arabic keyboard, the "ء" is in the same spot as the "x" on a qwerty keyboard.

The issue is that ' is not a legal character in citation keys, so it gets discarded. If 2 or x is an equally valid romanization, wouldn't it be better to make that a feature of jslingua? I would prefer to keep discussions about romanization between native speakers/experts, I cannot judge what is and isn't suitable.

retorquere commented 1 year ago

I could make it hnA2zAydEbAs2020 if you think it's better, but I will have to take your word for it that this is a global preference qua romanization rather than your personal preference. I have no way to distinguish the two.

kariminf commented 1 year ago

Hi,

Using "Buckwalter" transliteration هناء زايد عباس will be transliterated as hnA' zAyd EbAs

But names are a bit different to transliterate; in this case it is transliterated as: hanaa zayed abbas (both alef and hamza are transliterated to "a")

The problem with Arabic names transliteration is there are 2 different systems: east (English-pronunciation) and west (French-pronunciation). For example, ابو بكر can be transliterated into abu bakr (east) or abou bakr (west)

I will introduce names transliteration in jslingua in the coming version

retorquere commented 1 year ago

And replacing ' with 2?

kariminf commented 1 year ago

' is replaced by another a

instead of hana' we write hanaa

Serena-UB commented 1 year ago

For most purposes transliteration is phonetically oriented, i.e. we try to represent how the words are to be read. But in this case, we need an orthographically oriented transliteration, where each character is represented by one or two Latin keyboard characters

Serena-UB commented 1 year ago

There are many different transliteration systems for Arabic, there is not one or two. زايد can be rendered Zayed or Zayyed or Zayid etc. but in this case it doesn't matter, what matters is that it is possible to anticipate what the citation key for a reference will say. You cannot program ي to convert to "ye" because it will not apply to other words. "zAyd" is therefore a more appropriate orthographical romanization in this case.

Serena-UB commented 1 year ago

I could make it hnA2zAydEbAs2020 if you think it's better, but I will have to take your word for it that this is a global preference qua romanization rather than your personal preference. I have no way to distinguish the two.

Most academic transliteration systems will use symbols for this letter that are not legal characters for citation keys. And using "a" for two different Arabic characters seems to be confusing and imprecise. So I'm just trying to suggest other options that make sense. I don't think there is a right or wrong answer here. I suggested "2" and "x" for the reasons I mentioned, and I think either of them would work fine.

retorquere commented 1 year ago

This is exactly why I prefer that this kind of discussion is held between native speakers/experts.

Serena-UB commented 1 year ago

Ok, so what happens now?

retorquere commented 1 year ago

You tell me -- I see these comments about different romanizations for names, and I don't know whether or how to act on these. If the current romanization I showed above suffices for you, I'm willing to add that to the next release. But they may change depending on the follow-up from @kariminf.

kariminf commented 1 year ago

I think, since the objective is just to avoid special characters such as ' by replacing them by another such as 2, in this case a simple mapping from Buckwalter's transliteration is enough. So, no need to change jslingua (even if I add names transliteration, Buckwalter will still as it is).

There are other characters which can be considered as special : https://en.wikipedia.org/wiki/Buckwalter_transliteration

retorquere commented 1 year ago

I don't see the ' character in the buckwalter row of the table? The only non-alphanum characters I see are * and $. I see ' in the Qalam row.

kariminf commented 1 year ago

Under the table, there are hamza variants in addition to alef and harakat :

retorquere commented 1 year ago

{, } and ~ are also disallowed in citation keys.

Serena-UB commented 1 year ago

BBTCitKeysArabic.pdf

Serena-UB commented 1 year ago

I mapped the convertion table in the current version. What is it based on?

retorquere commented 1 year ago

https://www.npmjs.com/package/transliteration

retorquere commented 1 year ago

is the unicode replacement character and can be loosely understood as "unknown character".

Serena-UB commented 1 year ago

It doesn't seem easy to find a package with only allowed transliteration characters.

I don't know whether this is useful, but it shows a use of number two for versions of the letter hamza: https://github.com/amasad/arabish

u'ء': '2', u'أ': '2', u'ؤ': '2', u'إ': '2', u'ئ': '2', u'آ': '2',

retorquere commented 1 year ago

From https://github.com/amasad/arabish:

And it's not that hard! ... With a better training corpus and some simple tweaking to the rules we can get at least up to 80% accuracy of Yamli or similar services.

These two sentences seem to contradict one another. Surely if it's "not that hard", accuracy would be 100%.

Serena-UB commented 1 year ago

Ok, then what about simply using "a" for "ا" and ignoring all versions of "ء" so that they are converted to nothing? The only problem for researchers working with Arabic texts are the replacement characters.

retorquere commented 1 year ago

I don't know what about it -- as I said, I have no idea what is reasonable/desirable here.

Serena-UB commented 1 year ago

All these suggestions are reasonable and desirable, but there will always be different opinions. Use "a" for both as @kariminf suggested. That is what you meant, right @kariminf ?

retorquere commented 1 year ago

All these suggestions are reasonable and desirable, but there will always be different opinions.

If there will always be different opinions, I'm stuck negotiating between these opinions about a matter of which I understand nothing. But if the two of you can reach a consensus, I'll take that for now, with the proviso that if a different opinion swings by, I may change the implementation.

retorquere commented 1 year ago

ignoring all versions of "ء"

What does "all versions" mean here? Is it just "all occurrences" or are there indeed different versions of this character?

Under the table, there are hamza variants in addition to alef and harakat :

* lone hamza: '

I have no idea what hamza, alef and harakat mean.

If you guys can get me a simple mapping table to pre-apply before passing the content through jslingua and/or one that I can apply after the string is passed through jslingua, that is fine, but it can't depend on me understanding anything of what is going on in the romanization.

kariminf commented 1 year ago

I still don't understand what is the problem here. Do you want a transliteration which is based on : Phonetics or Orthography ?

As I understood, you want just an orthography-oriented transliteration. In this case, I suggest a simple solution :

{
    "'": "a",
    ">": "a",
    "<": "i",
    "&": "u",
    "}": "i",
    "|": "a",
    "{": "a",
    "`": "a",
}
retorquere commented 1 year ago

I still don't understand what is the problem here. Do you want a transliteration which is based on : Phonetics or Orthography ?

I don't even know what this means. I'm trying to negotiate a solution that makes sense to people who speak the language (which is not me).

As I understood, you want just an orthography-oriented transliteration. In this case, I suggest a simple solution :

* Use jslingua (Buckwalter) to transliterate Arabic letters into Latin; then

* Use the following python dictionary to map into a semi-phonetically oriented transliteration

"you" would be Arabic-speaking users of Zotero (which is not me), but this I can do. @Serena-UB, A new build will drop here in 10 minutes or so.

retorquere commented 1 year ago

@Serena-UB the citekey would come out as hnAazAydEbAs2020.

Serena-UB commented 1 year ago

Perfect, thank you both!

Serena-UB commented 1 year ago

But what about * and $ in Buckwalter - will they work in citation keys?

github-actions[bot] commented 1 year ago

:robot: this is your friendly neighborhood build bot announcing test build 6.7.54.3783 ("post-processing of jslingua")

Install in Zotero by downloading test build 6.7.54.3783, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

retorquere commented 1 year ago

But what about * and $ in Buckwalter - will they work in citation keys?

Yes, those will work.

Serena-UB commented 1 year ago

I downloaded it, but strangely it works only for the one reference, not for other references. Am I doing something wrong?

Serena-UB commented 1 year ago

I've been trying in various ways. The new transliteration now works only for references I download to Zotero through chrome, but not if I create a new reference and type in the author name myself.

Serena-UB commented 1 year ago

What could be the reason for that?

retorquere commented 1 year ago

I've been trying in various ways. The new transliteration now works only for references I download to Zotero through chrome, but not if I create a new reference and type in the author name myself.

That is strange. Please turn on debug logging in the Help menu, create such a new reference again, and then right-click that item and send a debug log.

Serena-UB commented 1 year ago

Yes, I tried refreshing the keys.

If I download a reference, it works, even if I change the author name and retype it. But if I create a new reference, I get the old transliteration system

I already submitted a new debug log for it: STR2X8IX-refs-euc

retorquere commented 1 year ago

The item doesn't have a language set. For implicit romanization, you need to set the language. You can also explicitly do auth.transliterate(ar) + year.

Serena-UB commented 1 year ago

Ah, I see, it works now. Thanks!