usnistgov / ccu_validation_scoring

Other
5 stars 0 forks source link

KeyError: "['length'] not in index" in the preprocess_reference_dir() call of scorer #3

Closed yrf1 closed 1 year ago

yrf1 commented 1 year ago

In the latest code update, we see that the preprocess_reference_dir() function in the CCU_validation_scoring/score_submission.py file seeks for data type and length in addition to fileID, in line 685: index_df = index_df[["file_id", "type", "length"]]

What are the type and length? I can guess that type is about video/audio/text but I'm really not sure about length. At some point, I saw a LDC download in which all data files have some length of 10000 or sth (but is that how it should be?).

I tried looking at the index files in test/reference/ to get a sense, but those reference files of how latest index files should look like have not been updated.

Hence, help appreciated, thank you!

jfiscus commented 1 year ago

The reference data directory needs to be structured identically to the reference data LDC releases. For a full working example, see https://github.com/usnistgov/ccu_validation_scoring/tree/master/test/reference/AlignFile_tests. This infor comes from the 'doc/file_info.tab' file. Yes, the type is video/audio/text and the length is the number of seconds (for audio/video) or characters (for text). The eval code presently uses the length to make no-score regions for source material after the annotated segments in the file. Thus, we used 10000 as a placeholder when we did not have the source data to set the value. You do the same to be expedient.