saulalbert / unixclan

Utility scripts for TalkBank's CLAN
0 stars 0 forks source link

Convert CHAT overlap markers to CAlite overlap markers #3

Closed saulalbert closed 6 years ago

saulalbert commented 6 years ago

CHAT uses around the overlapped terms on each line, then a square bracket to indicate that the overlap is 'to follow' [>] or - on the next line - that it has just occured [<]

Here is an example: the words 'and' and 'yes' are overlapped

20 PS006: er saw Mary and Andrew [>] . 21 PS002: [<] lovely .

We want this to be converted into calite format:

20 PS006: er saw Mary and Andrew ⌈and⌉ . 21 PS002: ⌊yes⌋ lovely .

There is already an INDENT script in the lib folder that will do the indentation to make the two overlapping terms line up.

mumair01 commented 6 years ago

I've written a script that is able to accurately convert and indent, overcoming the edge case we discussed. However, there might be some rare similar markers that may be accidentally converted along with comments. They are not part of an overlap pair and are found mostly where a third person joins the speech. We should discuss this one remaining issue.

saulalbert commented 6 years ago

Ok - this sounds good. We can discuss the outstanding issue next week and make sure the solution is robust enough to deal with most cases.

mumair01 commented 6 years ago

After reviewing the script I wrote, I decided to re-evaluate my approach. The new script is much more accurate and can deal with many more different cases. There are still two very specific edge cases that I can't solve. Also, what is the length of a typical speaker ID? Because difference in speaker ID lengths is what is causing issues.

saulalbert commented 6 years ago

Sounds good! Re: speaker ID lines, they're between 5 and 7 in the CABNC. Typically a CHAT transcript has 3 letter speaker names but the speaker IDs are longer in the CABNC.

Would it help to shorten / standardize them first when converting to CA from CHAT?

We could talk about making that work.

mumair01 commented 6 years ago

Besides some slight indentation issues caused by different speaker ID lengths, the script works. We can talk about standardizing ID length to completely avoid the issue.

saulalbert commented 6 years ago

Amazing work Umair! Very very nice. I'm looking forward to hearing you present it in the meeting tomorrow. Exciting stuff!