Open alecristia opened 6 years ago
I add some errors that I noticed previously :
And some thoughts to improve the way we're managing all the different formats.
This is terrific! Could you please do a few online searches before you start writing all of that? Just to check if someone has already done this. If the search turns up empty, I can drop a line to two people who might know before we put in any more effort.
Will we enforce the 10th "SLAT" column of RTTM, or make it optional? I"m not sure our tools treat it consistently but should probably accept both 9 and 10 fields per line.
Yep, indeed. It was not consistent and that led to some errors during the evaluation process. I just changed those problematic lines, I access to the RTTM fields in a different way (as soon as we can get the t_beg, duration and filename, it works).
Update : After digging into the rttm documentation, we discovered that indeed, the rttm files were white-spaced. Here, a useful example that is easier to understand than the doc :
SPEAKER YOUR_AUDIO_FILENAME 1 5.87 0.370 <NA> <NA> spkr1 <NA>
LEXEME YOUR_AUDIO_FILENAME 1 5.87 0.370 ஹலோ lex spkr1 0.5
SPEAKER YOUR_AUDIO_FILENAME 1 8.78 2.380 <NA> <NA> spkr1 <NA>
LEXEME YOUR_AUDIO_FILENAME 1 8.78 0.300 உம்ம் lex spkr1 0.5
LEXEME YOUR_AUDIO_FILENAME 1 9.08 0.480 அதான் lex spkr1 0.5
LEXEME YOUR_AUDIO_FILENAME 1 9.56 0.510 சரியான lex spkr1 0.5
LEXEME YOUR_AUDIO_FILENAME 1 10.07 0.560 மெசேஜ்டா lex spkr1 0.5
LEXEME YOUR_AUDIO_FILENAME 1 10.63 0.350 சான்ஸே lex spkr1 0.5
LEXEME YOUR_AUDIO_FILENAME 1 10.98 0.180 இல்லயே lex spkr1 0.5
That shows that there is no problem for the transcription field. Our (wrong) hypothesis was that, since there is a string field, the lines should be separated by tabulations because a confusion would appear between a white-space separator, and a "normal" white-space. Visibly, the format specification ensures that there are no extra white-spaces. All of the white-spaces have to be separators.
I updated all of the concerned repositories to standardize our use of the rttm format. The concerned repo are :
Now the output of the models should be all white-spaced. And the pipeline is designed to receive white-spaced reference rttm.
This is great to have RTTM formats be better standardized. I know there is one tool which may still produce improperly formatted RTTM; http://github.com/srvk/TALNet I believe uses double quotes around strings which contain spaces. Once we start actually using TALNet, we'll have to fix it, probably by stripping quotation marks of the output, and if not in an automated fashion, then manually modifying the .csv file containing the class names, likely replacing space with underscore.
I'd like to add: any RTTM our tools produce will use the Space character to separate fields, but just to err on the side of caution, maybe our tools should all accept Tab or Space, as "whitespace". I'd rather not make a hard requirement/assumption that externally obtained RTTM will conform to space as a field separator (though we could always advertise this as a requirement of the VM, and/or do simple conversion)
Yep, I agree with that. I'm just afraid that some errors may occur while converting tab-separated RTTM to space-separated RTTM (if one field contains more than one space). If we accept both tab and space, we should maybe add some sanity-check : for instance, check that the number of fields is equal to 9 (or 10), not more, not less.
Upon further discussion, we decide that getting the rttm format perfect is not a priority.
When we do get to it, we should read to get it right 2005 NIST eval plan, appendix C, reproduced next:
The Rich Transcription Time Mark (RTTM) file format (with ".rttm" filename extension) will be used for both the system output and
reference for the SPKR and SAD tasks. A separate RTTM file should be generated for each meeting in the test set. See Appendix A for a
detailed definition of the RTTM format. This appendix explains the application of the RTTM format for the SPKR and SAD tasks.
The RTTM format supports markup of a variety of metadata tasks. However, for RT-05S, only the information required for the SPKR and
SAD tasks should be provided. The RTTM file format provides two types of records related to the speakers: SPKR-INFO records and
SPEAKER records. The SPKR-INFO record for a speaker is associated with all the SPKR records for that speaker by means of matching
values in their name fields. RT-05S participants in the SPKR task will therefore need to output the following: one SPKR-INFO entry per
unique speaker in each source file followed by a SPEAKER entry for each occurrence of a given speaker in the source file.
Participants in the RT-05S evaluation can run their systems on two distinct microphone conditions: multiple distant mics (mdm) and single
distant mic (sdm). The source file name (file field in the RTTM records) to be used is the name of the microphone recording file for the sdm
condition. However, for the distant microphone conditions, the meeting ID (e.g., NIST_20020214-1148 from Section 12.2.2) is to be
used instead of the audio filename.
The SPKR-INFO records (one per speaker) associate the speaker type (adult_male, adult_female, child, or unknown) in the stype
field, with the speaker's name in the name field. There is only one SPKR-INFO record per speaker. The SPKR-INFO records are typically
all put at the beginning of the RTTM file since they have no associated timestamp. For RT-05S, the speaker type will not be evaluated so
there is no need for participants to provide a value other than “unknown” for the stype field.
The SPEAKER records give information about when a speaker is speaking. Each time the speaker (identified by its name) starts speaking,
there is a SPEAKER record that states the time when the speaker began speaking (tbeg) and how long the speaker spoke (tdur).
SPKR example for the sdm condition on a recording named NIST_20020214-1148_d05_NONE.sph:
SPKR-INFO NIST_20020214-1148_d05_NONE.sph 1
SPKR Example for the mdm condition:
SPKR-INFO NIST_20020214-1148 1
For the SAD task, only one SPKR-INFO line is required per source file regardless of how many speakers exist in a recording. The stype should be “unknown” and will not be taken into account for scoring
SAD Example for the mdm condition:
SPKR-INFO NIST_20020214-1148 1
rttm is explained in pp. 18+ here: https://web.archive.org/web/20170119114252/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf
Some errors I noticed in the instructions:
These need to be fixed in the input/output assumptions of the different tools as well