nii-yamagishilab / PartialSpoof

BSD 3-Clause "New" or "Revised" License
33 stars 4 forks source link

Dimensions of segment-level labels do not match the dimensions of model output. #1

Closed ZhongJiafeng-16 closed 1 year ago

ZhongJiafeng-16 commented 1 year ago

Hello,Thanks to your team for the great work and open sourcing this code. These codes helped me a lot. and I found that the segment-level labels from PartialSpoof dataset have more dimensions than the predicted results.

for example, the sample CON_E_0000000.wav in eval set, have segment-level label of 105 dimension and I got prediction output of 103 dimension in 20ms temporal resolution using the mutireso version pretrain model offered in /03mutireso/01_download_database.sh.

Could you please ansewer the following questions? Thank you very much!

  1. Does this mean that part of the speech data are discarded during model forward?
  2. How did the EER results in the paper deal with the differences in the dimensions of labels and model output?
zlin0 commented 1 year ago

@ZhongJiafeng-16 Hi, thank you for your interested in our work. I see, you mean the time dimension of the output from the model and are mismatched.

  1. Does this mean that part of the speech data are discarded during model forward?

Yes, I discarded such mismatched parts during model training, and you can find the alignment function (_prepare_segcon_target_ali) in Line 531 and Line 458 of model.py

  1. How did the EER results in the paper deal with the differences in the dimensions of labels and model output?

I removed these mismatched segments during EER calculating too. I aligned the predicted score and target by using above mentioned (_prepare_segcon_target_ali) function. For ease of calculation, I saved the aligned [predict_score, groundtruth_target] to a pkl file and performed measurements afterward. This process is implemented in lines 565 to 568 of models.py.

Honestly, the measuring process in this journal paper for EER is quite trick as we ideally should not discard any segments. We recently proposed a new metric called range-based EER (which takes all misclassified/missed duration into account), and the code will be released soon. In this new metric, we retain all mismatched segments and calculate EER accordingly.

zlin0 commented 1 year ago

Please feel free to reach out if you have any other questions or uncertainties.

ZhongJiafeng-16 commented 1 year ago

@zlin0 Thanks for the reply. it is really clear and helpful. I will continue to follow the code update. Thanks again.

ductuantruong commented 10 months ago

Hi,

Thank you for sharing your amazing work. I am trying to develop a new codebase based on yours. I have a few questions related to this issue topic, and hopefully, you can help me with it. My questions are as follows:

  1. Does the alignment function (__prepare_segcon_targetali) make the predicted output and the ground-truth label have the same length by cutting the mismatch in one of them at the end of the list (i.e using [:aligned_lenght])?
  2. You mentioned "I saved the aligned [predict_score, groundtruth_target] to a pkl file and performed measurements afterward". I am trying to find the code that computing EER from the pkl file in your codebase but still can't find it. Could you help me pinpoint where this code is located in your codebase?
  3. I am directly using wav2vec2-xlsr embedding as the output for frame-level classification which should have the 20ms resolution as mentioned in your paper. However, my predicted output (inference with batch_size = 1) is still mismatched with your 20ms resolution ground-truth label (the mismatches are usually the label length having 1 or 2 frames longer than the predicted output). Do you think this mismatch is normal?

Thank you for spending time to help us!

zlin0 commented 10 months ago

@ductuantruong thank you for your interest! Sorry for the delayed response; I was quite busy and spend few days to clean up my code. I've updated the codes for metrics in the folder: PartialSpoof/metric. Please refer to PartialSpoof/metric/README.md for more information. (Also apologies to @ZhongJiafeng-16 for my late update!)

As I mentioned previously, we ideally should not discard any segments. In the latest code update in PartialSpoof/metric, I retain all mismatched regions and measure performance using timestamps accordingly. Also, please download the file PS_data.tar.gz, which contains the timestamps I used for more precise performance measurement. This is a temporary link; I plan to upload the data to Zenodo soon with further instructions.

I believe your questions can be answered with my updated code. Regarding your third question (which is not necessary to consider with the current code, but I will answer here for your information):

  1. predicted output is still mismatched with ground-truth label, ..., the mismatches are usually the label length having 1 or 2 frames longer than the predicted output. ... Do you think this mismatch is normal?

Yes, it is normal. I provide as much annotation information as possible for users. Even for segments with durations shorter than a certain resolution (like 20ms), I still provide labels. The length of the predicted results is determined by the model, so it might not always be able to produce predictions for segments shorter than the resolution.

Please feel free to let me know if you have further questions.

ductuantruong commented 10 months ago

Thank you for your reply. I think you clear most of my concerns except the first question about __prepare_segcon_targetali function. Could you kindly answer the above first question since I am trying to develop a new model and I want the training process to be similar to yours?

Once again, thank you for spending your quality time to support me!

zlin0 commented 10 months ago

Thank you for your reply. I think you clear most of my concerns except the first question about __prepare_segcon_targetali function. Could you kindly answer the above first question since I am trying to develop a new model and I want the training process to be similar to yours?

@ductuantruong Yes, I did. The answer can actually be found in the function _prepare_segcon_target_ali within model.py, specifically in lines 484 to 487:

    time_len = score.shape[0] 
    target_vec = torch.tensor(
        target[:time_len], device=score.device, dtype=score.dtype)

[:time_len] here is what you write as [:aligned_lenght] in your question.

Although I replied to you, I am not sure why you are still concerned about this. This alignment only impacts the final measurement (calculating the EER) but does not affect the training processing. I use this function solely for the final measurement. Moreover, alignment is not required for the latest code used in measuring and is more precise.

zlin0 commented 10 months ago

Maybe you want to compare EER values with my paper? I recommend comparing them to Table 3 and Table 4 in reference [2] (note that Table 4 is in the appendix of the Arxiv version). The models and predicted scores are the same for references [1] and [2], the difference is that I use the more precise way to measure performance (as in the latest code PartialSpoof/metric). Additionally, the results shown in Fig. 3 of [1] represent an identical 'rough' calculation of the diagonals from Tables 3 & 4 in [2]. For a quick understanding of these tables, please refer to this poster.

[1] The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance [2] Range-Based Equal Error Rate for Spoof Localization

ductuantruong commented 10 months ago

Hi @zlin0,

I deeply appreciate your detailed explanation. I am asking that because I still encountered a number of frame mismatches between the predicted output and ground-truth label. And, I am not certain whether to cut the mismatch at the beginning or the end of the sequences. Furthermore, I do want to compare to your results in [1] so I just want to make sure the mismatch cutting is identical to your implementation.

Lastly, I don't have any further concerns. Once again, thank you for spending your quality time to support me!