Closed ZhongJiafeng-16 closed 1 year ago
@ZhongJiafeng-16 Hi, thank you for your interested in our work. I see, you mean the time dimension of the output from the model and are mismatched.
- Does this mean that part of the speech data are discarded during model forward?
Yes, I discarded such mismatched parts during model training, and you can find the alignment function (_prepare_segcon_target_ali) in Line 531 and Line 458 of model.py
- How did the EER results in the paper deal with the differences in the dimensions of labels and model output?
I removed these mismatched segments during EER calculating too. I aligned the predicted score and target by using above mentioned (_prepare_segcon_target_ali) function. For ease of calculation, I saved the aligned [predict_score, groundtruth_target] to a pkl file and performed measurements afterward. This process is implemented in lines 565 to 568 of models.py.
Honestly, the measuring process in this journal paper for EER is quite trick as we ideally should not discard any segments. We recently proposed a new metric called range-based EER (which takes all misclassified/missed duration into account), and the code will be released soon. In this new metric, we retain all mismatched segments and calculate EER accordingly.
Please feel free to reach out if you have any other questions or uncertainties.
@zlin0 Thanks for the reply. it is really clear and helpful. I will continue to follow the code update. Thanks again.
Hi,
Thank you for sharing your amazing work. I am trying to develop a new codebase based on yours. I have a few questions related to this issue topic, and hopefully, you can help me with it. My questions are as follows:
Thank you for spending time to help us!
@ductuantruong thank you for your interest! Sorry for the delayed response; I was quite busy and spend few days to clean up my code.
I've updated the codes for metrics in the folder: PartialSpoof/metric
. Please refer to PartialSpoof/metric/README.md
for more information. (Also apologies to @ZhongJiafeng-16 for my late update!)
metric/UtteranceEER.py
: measure utterance-level spoof detectionmetric/SegmentEER.py
: measure localization performance by using Point-based segment-level EERmetric/RangeEER.py
. : measure localization performance by using Range-based EER.As I mentioned previously, we ideally should not discard any segments. In the latest code update in PartialSpoof/metric, I retain all mismatched regions and measure performance using timestamps accordingly. Also, please download the file PS_data.tar.gz, which contains the timestamps I used for more precise performance measurement. This is a temporary link; I plan to upload the data to Zenodo soon with further instructions.
I believe your questions can be answered with my updated code. Regarding your third question (which is not necessary to consider with the current code, but I will answer here for your information):
- predicted output is still mismatched with ground-truth label, ..., the mismatches are usually the label length having 1 or 2 frames longer than the predicted output. ... Do you think this mismatch is normal?
Yes, it is normal. I provide as much annotation information as possible for users. Even for segments with durations shorter than a certain resolution (like 20ms), I still provide labels. The length of the predicted results is determined by the model, so it might not always be able to produce predictions for segments shorter than the resolution.
Please feel free to let me know if you have further questions.
Thank you for your reply. I think you clear most of my concerns except the first question about __prepare_segcon_targetali function. Could you kindly answer the above first question since I am trying to develop a new model and I want the training process to be similar to yours?
Once again, thank you for spending your quality time to support me!
Thank you for your reply. I think you clear most of my concerns except the first question about __prepare_segcon_targetali function. Could you kindly answer the above first question since I am trying to develop a new model and I want the training process to be similar to yours?
@ductuantruong Yes, I did. The answer can actually be found in the function
_prepare_segcon_target_ali
within model.py, specifically in lines 484 to 487:
time_len = score.shape[0]
target_vec = torch.tensor(
target[:time_len], device=score.device, dtype=score.dtype)
[:time_len]
here is what you write as [:aligned_lenght]
in your question.
Although I replied to you, I am not sure why you are still concerned about this. This alignment only impacts the final measurement (calculating the EER) but does not affect the training processing. I use this function solely for the final measurement. Moreover, alignment is not required for the latest code used in measuring and is more precise.
Maybe you want to compare EER values with my paper? I recommend comparing them to Table 3 and Table 4 in reference [2] (note that Table 4 is in the appendix of the Arxiv version). The models and predicted scores are the same for references [1] and [2], the difference is that I use the more precise way to measure performance (as in the latest code PartialSpoof/metric
). Additionally, the results shown in Fig. 3 of [1] represent an identical 'rough' calculation of the diagonals from Tables 3 & 4 in [2]. For a quick understanding of these tables, please refer to this poster.
[1] The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance [2] Range-Based Equal Error Rate for Spoof Localization
Hi @zlin0,
I deeply appreciate your detailed explanation. I am asking that because I still encountered a number of frame mismatches between the predicted output and ground-truth label. And, I am not certain whether to cut the mismatch at the beginning or the end of the sequences. Furthermore, I do want to compare to your results in [1] so I just want to make sure the mismatch cutting is identical to your implementation.
Lastly, I don't have any further concerns. Once again, thank you for spending your quality time to support me!
Hello,Thanks to your team for the great work and open sourcing this code. These codes helped me a lot. and I found that the segment-level labels from PartialSpoof dataset have more dimensions than the predicted results.
for example, the sample
CON_E_0000000.wav
in eval set, have segment-level label of 105 dimension and I got prediction output of 103 dimension in 20ms temporal resolution using the mutireso version pretrain model offered in/03mutireso/01_download_database.sh
.Could you please ansewer the following questions? Thank you very much!