nlplab / brat

brat rapid annotation tool (brat) - for all your textual annotation needs
http://brat.nlplab.org
Other
1.81k stars 509 forks source link

Support to Annotate Arabic #774

Open ghost opened 12 years ago

ghost commented 12 years ago

This is issue is to discuss the current short-comings regarding Arabic script and how/if it can be resolved given our current architecture.

Emad Mohamed mentioned on the corpora mailing list that they can use the ASCII Backwater encoding for Arabic but that it is sub-optimal. We really need a native to help out with thin but at least from what I could read at CPAN it looks like a dreadful hack to get Arabic into ASCII.

According to @amadanmath, the following should be an issue:

<head>
    <meta charset="utf-8"/>
</head>

<body dir="rtl">
    <p>
        <span dir="rtl">
            لغة إنجليزية <span dir="ltr">English</span> لغة إنجليزية
        </span>
    </p>

    <p>
        <span dir="rtl">
            لغة إنجليزية English لغة إنجليزية
        </span>
    </p>
</body>

But it appears that at least Firefox renders both the same and handles the English portion correctly.

From talking to one of the attendees at EACL 2012 tokenisation may also become an issue. For this we could use a similar approach as we have already done for Japanese and incorporate a morphological analyser to find the start and end of the "tokens".

Here is one I found after some minor Googling:

https://github.com/mosta/raramorph
fsalotaibi commented 12 years ago

Hi there,

I'm working in Arabic NLP and very interesting to help to get this tool supporting Arabic.

I believe that, if this happened this tool will get so many citations as the researches on Arabic NLP are flourishing these days and became important.

Regarding supporting the transliteration of the ASCII version (Buckwalter) instead of the actual Arabic glyphs, I believe this is not a good choice. As you know the readability of the transliteration is difficult especially for the one who is working on the annotation task.

The optimal choice is to support the RTL with Arabic glyphs.

Please feel free to contact with me as I'm so happy to be engaged.

Fahd

spyysalo commented 12 years ago

Hi @fsalotaibi,

Thanks for your interest in brat! We're happy to welcome any contribution to Arabic support in brat, and would much appreciate your help on this feature.

For rendering the actual glyphs in brat, as a first step, we would need to know how to create an SVG document with Arabic that renders correctly in at least some major browser. If you can look into this, it would be very helpful if you could try exporting an SVG with Arabic from brat (from Data->Visualization->SVG) and see if you can edit it to render correctly.

amadanmath commented 12 years ago

But it appears that at least Firefox renders both the same and handles the English portion correctly.

Yeah, Firefox might do the right thing when rendering HTML. However, note that we're laying out each word separately by drawing it onto the SVG canvas; so we do not have access to Firefox's heuristics. We need to know the order of the spans. So in the case of English language لغة إنجليزية, the linear order (in the text file) is

1) لغة (language) 2) إنجليزية (English) 3) English 4) language

and it should also be the order in which the elements are set in SVG (for copy/paste purposes); but the coordinates on the screen (and ultimately the visual effect) needs to be (seen from left to right):

3) English 4) language 2) إنجليزية (English) 1) لغة (language)

spyysalo commented 12 years ago

@amadanmath : do I understand correctly that this last issue you mention is that it would be necessary to reverse the RTL order for parts of the document that do not use Arabic glyphs? If we were to assume that there are no such strings (i.e. everything is RTL) or that the text input has already reversed these appropriately, would this substantially ease the task?

amadanmath commented 12 years ago

Yes, I suppose that's what I'm saying. Note that for the copy-paste to work properly you'd need to make sure that only the coordinates are reshuffled, but the order in which they're put into SVG is not. I believe a good algorithm might be: lay all chunks out as they appear (showing them RTL); then find sequences of LTR chunks in the same row, and recalculate their coordinates so that they appear in the reverse order, without changing anything else.

Obviously if there's no RTL text, the task is easier. Still not easy, since we have a bunch of places where the assumption is LTR. Also, I'm still not 100% convinced I'd know how to tell LTR chunks from RTL ones.

It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid.

spyysalo commented 12 years ago

Even though I don't really know about the client, the example you give sounds like it would take a lot of work to do right. If we want to get all that for the first iteration of Arabic support, I'm guessing it might be a while.

Could the tool still be useful for annotating Arabic if we were to assume that everything is RTL? This would get cases like English language لغة إنجليزية, and "كربون12 wrong, but perhaps it would still be better to have "mostly OK" support now rather than perfect support much later? (Comments from someone with an understanding of the frequency of these types of cases would be much appreciated!)

fsalotaibi commented 12 years ago

"It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid."

What researches do when want to annotate a piece of Arabic text is to do the tokenization first as a preprocessing. So it is not the brat duty to take care of the proper tokenization. I believe no one will try to annotate such thing like : ["كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1")].

The word order of the mixed Arabic and English is perfectly handled by Microsoft bench softwares such as word. We could inspire the same algorithm to do so. But I'll give you a simple statistic that may convince you: taking the Arabic Wikipedia as a case study I found that: 85.3% of the tokens are Arabic words. 1.2% of the tokens are English words. 2.71% of the tokens are numbers. 10.83 of the tokens are symbols.

So the mix of RTL and LTR would not be that serieos (currently but it is very powerful to be supported) as the total number of English words is very small. I'm in doubt about the numbers and symbols.

\ If you do the option of assuming everything is RTL, I'm happy to test it and give you the feed back for the pros and cons.

Fahd

spyysalo commented 12 years ago

@fsalotaibi : thank you for the information and statistics! I believe it should make the initial implementation much easier if we can make the assumptions that 1) the text is pre-tokenized 2) everything is RLT. @amadanmath : what would implementing this require on the server side?

fsalotaibi commented 12 years ago

I hope not annoying, any news about supporting Arabic. Actually I'm involved with other in building an Arabic NE corpus as we are planning to start annotating in two weeks time. I really support this nice tool to be used based on its functionality. Team members still waiting for it as well. I'm afraid the time will be the issue.

I believe this would be very good reputation once supporting such RTL language.

\ I tried to modify the code, but actually I stucked to understand how the calculation of the glyphs happened to switch to RTL instead. Can anyone pinpoint me to right piece of work to let me try?

spyysalo commented 12 years ago

@fsalotaibi : not annoying at all, thanks for reminding us! We have a few other features prioritized right now, but if you're willing to have a look at the code, we'd be happy to help.

@amadanmath : could you provide some pointers on what would need to be changed to make this happen?

fsalotaibi commented 12 years ago

@spyysalo: Thank you, I'm trying my best to understand how this could happen. It seems brat is a big project to understand in short time. I only have two weeks to start the annotation project, and I do still support this tool within my team.

@amadanmath : I worked on a prototype to illustirate what are needed to support Arabic:

  1. Actually brat already supports utf-8, so the character are fully supported.
  2. The only problem is with the direction displayed for both the text and the annotation tags including the arcs. The current output when displaying Arabic sentences: http://i50.tinypic.com/1slf7s.png By the way, I tested this on Chrome, Safari and Firefox. The best display I got is with Firefox even it is not fully supported by brat. Look at how this looks on Chrome and Safari: http://i48.tinypic.com/2wbvww7.png It is completely overlapping.

The desired and proper way is shown in the following prototype: http://i48.tinypic.com/2r3kta8.png

As you can see:

  1. The box is all in RTL
  2. Sentence number, i.e. row numbers, are in the right
  3. The token, i.e. word, and the tag are aligned properly.
  4. The arc direction is from right to left.

This is what we need for this stage. I'm not sure how difficult this work is. As I said earlier, I'm very happy to evaluate this work while doing the support. I'm really exciting to let this tool supporting Arabic. I believe this will open many doors for other researchers.

spyysalo commented 12 years ago

@fsalotaibi : thank you for your efforts on this! I'm afraid I can't help myself on the technical aspects as I don't know the relevant part of the client code, but hopefully @amadanmath can. I agree this would be a valuable feature to have.

For ease of reference, I'm placing your screenshots inline here (click on "GitHub Flavored Markdown" in the comment form for syntax):

The current output when displaying Arabic sentences:

how this looks on Chrome and Safari:

The desired and proper way is shown in the following prototype:

amadanmath commented 12 years ago

Okay, some quick pointers:

If you look at client/src/visualizer.js, you will find the function renderDataReal. It is rather huge, and does the layout.

In it you will find the variable currentX. It starts a little past the left edge (leaving room for the sentence number), and will be used to position the next chunk. Here is a check if the current chunk has overflowed the right margin and needs to be put into a new row; if so, currentX is reset to the start of the next row.

As the first step, these procedures would need to be reversed; if RTL language is rendered, start with the right edge (leaving the space for the sentence number), decrease currentX, and check if it falls below the left margin.

I don't know what getStartPositionOfChar and similar functions return for RTL languages (positive or negative numbers? where is the origin?) but you will likely need to also mess with the function getTextAndSpanTextMeasurements, which calculates at which point spans start and finish inside their chunks. Also depending on where the origin is, you might need to change the calculation of the position of the span boxes... And places where it says fragment.right or similar, they would actually need to point at the left side...

There's a bunch of things I am skipping over here, as the visualisation part is quite complex.

spyysalo commented 11 years ago

Hi @fsalotaibi : I chanced on https://www.odesk.com/o/jobs/job/Modifying-Javascript-canvas-GUI_~~fb065ce0129fa79c/, which suggests that you found a way to implement Arabic support. Great! Would you be prepared to consider contributing the implementation of this feature back to brat, so that others in the user community could also benefit from it?

ghost commented 11 years ago

I had no idea that Unicode had RTR and RTL features, so I will leave this link here for future reference even though using it is discouraged: http://www.w3.org/International/questions/qa-bidi-controls

FatimahNLP commented 9 years ago

Hello all. I need urgent help. does the brat tool support Arabic labeling ? my project need the Arabic annotation tool. please if yes tell me the steps to support Arabic language labeling in brat tool.

spyysalo commented 9 years ago

No explicit support has been implemented, but from some recent discussion on the mailing list it appears that it is possible to use brat to annotate Arabic using recent versions of Firefox.

FatimahNLP commented 9 years ago

Thanks @spyysalo Who can help me in the way, to add labeling and annotation in Arabic

ghost commented 9 years ago

As relevant as this is, I don't see it happening before v1.4.

spyysalo commented 9 years ago

As discussed on the list recently, there has been some success annotating Arabic on recent versions of Firefox. We might wish to document the conditions for making this work.

fsalotaibi commented 9 years ago

That was vey long time. We successfully managed to apply the right to left (RTL) into brat. Please see as an example of Arabic (RTL) text: http://www.ebsar.com/brat/#/FGANER/109-out

The modification is part of our project and it is still not released to the public. Meanwhile, anyone who wants to use brat on our server, please don't hesitate to contact me on fahd_alotaibi(AT)hotmail.com, we may be able to give you such access to use it online to tag Arabic text.

\ Please use either Google Chrome or Firefox to have the correct rendering result. (internet explorer is not supported)

icycandy commented 9 years ago

@fsalotaibi Do you have any plan to release to the public? I have some arabic text to annote, and currently excel is used.

FatimahNLP commented 9 years ago

Thanks very much fsalotaibi and icycandy I appreciate your help I need the steps to let brat accepts text from left to right, steps to annotate Arabic text using brat thanks in advance.

reckart commented 9 years ago

We have added experimental support for left-to-right to WebAnno now. To this end, I have patched the brat Javascript files from brat that we use in WebAnno to support an LTR and an RTL mode. The changes are all conspicuously marked and should be reasonable easy to transfer back into brat.

In particular, the changes do

Some functionalities may not have been fixed for RTL because we don't use them in WebAnno.

Also, there are some known issues, e.g.:

Anybody interested in integrating this back into brat?

https://github.com/webanno/webanno/blob/2.2.x/webanno-brat/src/main/java/de/tudarmstadt/ukp/clarin/webanno/brat/resource/visualizer.js

ghost commented 9 years ago

@reckart: Cool! We are certainly interested. @amadanmath: When you have the time, could you have a look at putting this into a branch?

lcrist commented 9 years ago

Hi, just wondered if there's been any activity or timeline for inclusion of RTL abled brat?

spyysalo commented 9 years ago

@amadanmath : could you please have a look at https://github.com/nlplab/brat/issues/774#issuecomment-116701423 and #1150?

amadanmath commented 8 years ago

Sorry it took me forever to address this; WebAnno changes backported to brat. Thank you, @reckart.

It is committed to the branch feature-rtl; if anyone wants to test it, please do (I can't test it properly as I can't read any RTL languages).

You will need to include the following in the visual.conf:

[options]
Text direction:rtl
amadanmath commented 8 years ago

Seems it bugs a bit on mixed directionality text -- try selecting half of the abbreviation and half of the neighbouring Arabic word:

والحلقة الناقصة كانت دمج برنامج CYPNET الذي ينقل الملفات، ببرنامج SNDMSG الذي يكتب الرسالة، وكان نتاج هذا الاندماج البريد الإلكتروني.

reckart commented 8 years ago

Fabulous!

Well, yes, mixed tokens are a known issue in our code. I hope that sharing the code between brat and WebAnno increases the chance that somebody picks up the baton and addresses the remaining issues and that both projects can profit from this.

See also: https://github.com/webanno/webanno/issues/49

reckart commented 8 years ago

I finally found some non-trivially annotated RTL data (in Hebrew) which shows that the RTL layout doesn't push out the labels sufficiently. This needs some improvement. Cf. https://github.com/webanno/webanno/issues/273

reckart commented 8 years ago

@amadanmath if you have any hot pointers where to look regarding fixing the "pushing", would be great!

reckart commented 8 years ago

Looks like a general layout problem with wide labels, not limited to the RTL layout or RTL glyphs.

reckart commented 8 years ago

Managed to fix the layout issue ;) webanno/webanno#273

reckart commented 8 years ago

You might find this also interesting: https://github.com/webanno/webanno/issues/265#issuecomment-220464561

amadanmath commented 8 years ago

Merged into master branch now.

reckart commented 8 years ago

I should mention that there have been more improvements to RTL mode in WebAnno, also some issues still open to be resolved:

https://github.com/webanno/webanno/issues?utf8=✓&q=is%3Aissue%20label%3ARTL