Closed yrf1 closed 1 year ago
The reference data directory needs to be structured identically to the reference data LDC releases. For a full working example, see https://github.com/usnistgov/ccu_validation_scoring/tree/master/test/reference/AlignFile_tests. This infor comes from the 'doc/file_info.tab' file. Yes, the type is video/audio/text and the length is the number of seconds (for audio/video) or characters (for text). The eval code presently uses the length to make no-score regions for source material after the annotated segments in the file. Thus, we used 10000 as a placeholder when we did not have the source data to set the value. You do the same to be expedient.
In the latest code update, we see that the
preprocess_reference_dir()
function in theCCU_validation_scoring/score_submission.py
file seeks for data type and length in addition to fileID, in line 685:index_df = index_df[["file_id", "type", "length"]]
What are the type and length? I can guess that type is about video/audio/text but I'm really not sure about length. At some point, I saw a LDC download in which all data files have some length of 10000 or sth (but is that how it should be?).
I tried looking at the index files in
test/reference/
to get a sense, but those reference files of how latest index files should look like have not been updated.Hence, help appreciated, thank you!