modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
5.88k stars 636 forks source link

I can't find data prepared part recipe codes about sond diarization model #1959

Open shanguanma opened 1 month ago

shanguanma commented 1 month ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

❓ Questions and Help

This is issue is same as https://github.com/modelscope/FunASR/issues/1916 I have follow your @LauraGPT sugestion and refer branch-0.8.8, the details are as follows:

I refer https://github.com/modelscope/FunASR/blob/v0.8.8/egs/alimeeting/diarization/sond/run.sh, I can use the pretrained sond model and preprocessed alimeeting test set to get the DER claimed in the paper(https://arxiv.org/pdf/2211.10243) of 4.12% on alimeeting test set. however, How to get preprocessed test data? The relevant code is missing in the current folder. I refer https://github.com/modelscope/FunASR/blob/v0.8.8/egs/alimeeting/modular_sa_asr/run_diar.sh, For obtaining speaker profile, It uses Vbx to get the first diarization, then performs rttm2segment and remove overlap operations, obtains xvector on this basis, and finally performs resegment_data operation. The segments and wav.scp contained in the data_source_dir in this step are not clearly stated. I made two assumptions here. Assumption 1, it comes directly from the segments and wav.scp after using pretrain vad, and finally obtains DER 10.69% on alimeeting eval. Assumption 2, it comes directly from oracle segments (I use https://github.com/modelscope/FunASR/blob/v0.8.8/egs/alimeeting/sa_asr/run.sh --stage1 ---stop-stage 1 to obtain data/org/Eval_Ali_far/{segments,wav.scp} as data_source_dir, and finally obtain DER 10.14% on alimeeting eval) , However, It still does the same operation mentioned in the paper to get the speaker profile by using blstm spectral cluster, and the DER here is much different from what is claimed in the paper.

Could you help me and solve and improve the result?

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Code

What have you tried?

What's your environment?

LauraGPT commented 1 month ago

@ZhihaoDU Please help check it?

shanguanma commented 1 month ago

@ZhihaoDU , Any comments?

ZhihaoDU commented 1 month ago

You can obtain standard speaker diarization files, such as wav.scp, rttm, with the official recipe of Alimeeting competition. Then, you can refer the TOLD/soap recipe in https://github.com/modelscope/FunASR/blob/v0.8.8/egs/callhome/TOLD/soap/run.sh. Although this recipe is for callhome, but the data preparing can be shared between SOAP and SOND with the standard speaker diarization files. Note that, while the oracle VAD information is used in Alimeeting at the inference, callhome results are based on the VAD model outputs.

shanguanma commented 1 month ago

@ZhihaoDU ,Thanks for your reply, I follow your sugestion and prepared this script for alimeeting eval dataset using release sond model and xvector model. However ,final DER is very bad. I don't know where I did wrong, please point it out. I'll paste the shell code below.

step1: prepared wav.scp and rttm files

#!/bin/bash

. ./path.sh || exit 1; # its contains kaldi and funasr envirenment.
stage=0
stop_stage=1000

. utils/parse_options.sh || exit 1;

## because alimeeting has 8 channels, we are now focusing on the single-channel diarization system.
## so we will extract mono channel audio wavform.
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
    input_dir=/mntcephfs/lab_data/maduo/datasets/alimeeting/Eval_Ali/Eval_Ali_far/audio_dir
    output_dir=data/alimeeting_mono/Eval/audio_dir
    python local/get_alimeeting_mono_audio.py \
        $input_dir  $output_dir
fi

# refer from https://github.com/yufan-aslp/AliMeeting/blob/main/speaker/run.sh
if [ $stage -le 1  ] && [ ${stop_stage} -ge 1 ];then
   echo "prepared alimeeting eval set ref rttm file"
   textgrid_dir=/mntcephfs/lab_data/maduo/datasets/alimeeting/Eval_Ali/Eval_Ali_far/textgrid_dir
   audio_dir=data/alimeeting_mono/Eval/audio_dir/
   dest_dir=data/alimeeting_mono/Eval/v9
   work_dir=$dest_dir/.work
   mkdir -p $work_dir
   find -L $audio_dir -name "*.wav" > $work_dir/wavlist
    sort  $work_dir/wavlist > $work_dir/tmp
    cp $work_dir/tmp $work_dir/wavlist
    awk -F '/' '{print $NF}' $work_dir/wavlist | awk -F '.' '{print $1}' > $work_dir/uttid

   find -L $textgrid_dir -iname "*.TextGrid" >  $work_dir/textgrid.flist
    sort  $work_dir/textgrid.flist  > $work_dir/tmp
    cp $work_dir/tmp $work_dir/textgrid.flist
    paste $work_dir/uttid $work_dir/textgrid.flist > $work_dir/uttid_textgrid.flist
    paste $work_dir/uttid $work_dir/wavlist > $dest_dir/wav.scp

    paste $work_dir/uttid $work_dir/uttid > $work_dir/utt2spk
    cp $work_dir/utt2spk $work_dir/spk2utt
    cp $work_dir/uttid $work_dir/text
    while read line;do
        text_grid=`echo $line | awk '{print $1}'`
        text_grid_path=`echo $line | awk '{print $2}'`
    echo "text_grid: $text_grid"
    echo "text_grid_path: ${text_grid_path}"
        python3 local/make_textgrid_rttm.py\
        --input_textgrid_file $text_grid_path \
        --uttid $text_grid \
        --output_rttm_file $work_dir/${text_grid}.rttm
    done < $work_dir/uttid_textgrid.flist
    #dest_dir=data/alimeeting_mono/Eval
    cat $work_dir/*.rttm > $dest_dir/alimeeting_eval.rttm
    cat $dest_dir/alimeeting_eval.rttm > $dest_dir/ref.rttm
    mv $work_dir/{spk2utt,utt2spk,text} $dest_dir/
fi

step2: get non-overlap segments via ref.rttm file

datadir=data/alimeeting_mono
version=v9
dumpdir=dump
expdir=exp
train_cmd=utils/run.pl
sr=16000
nj=8
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
  echo "Stage 2: Extract non-overlap segments from alimeeting eval dataset"
  for dset in Eval ; do
    echo "Stage 2: Extracting non-overlap segments for "${dset}
    mkdir -p ${dumpdir}/${dset}/nonoverlap_0s
    python3 -Wignore script/extract_nonoverlap_segments.py \
      ${datadir}/${dset}/${version}/wav.scp ${datadir}/${dset}/${version}/ref.rttm ${dumpdir}/${dset}/${version}/nonoverlap_0s \
      --min_dur 0.1 --max_spk_num 4 --sr ${sr} --no_pbar --nj ${nj}

    mkdir -p ${datadir}/${dset}/${version}/nonoverlap_0s
    find ${dumpdir}/${dset}/${version}/nonoverlap_0s/ -iname "*.wav" | sort | awk -F'[/.]' '{print $(NF-1),$0}' > ${datadir}/${dset}/${version}/nonoverlap_0s/wav.scp
    awk -F'[/.]' '{print $(NF-1),$(NF-2)}' ${datadir}/${dset}/${version}/nonoverlap_0s/wav.scp > ${datadir}/${dset}/${version}/nonoverlap_0s/utt2spk
    echo "Done."
  done
fi

step3: get 80-dimensions fbank feature.

if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
  echo "Stage 5: Generate fbank features"
  #home_path=`pwd`
  #cd ${kaldi_root}/egs/callhome_diarization/v2 || exit

  #. ./cmd.sh
  . ./path.sh

  for dset in Eval; do
    steps/make_fbank.sh --write-utt2num-frames true --fbank-config conf/fbank_16k.conf --nj ${nj} --cmd "$train_cmd" \
        ${datadir}/${dset}/${version} ${expdir}/make_fbank/${dset}/${version} ${dumpdir}/${dset}/${version}fbank
    utils/fix_data_dir.sh ${datadir}/${dset}/${version}
  done

  for dset in Eval/${version}/nonoverlap_0s; do
    steps/make_fbank.sh --write-utt2num-frames true --fbank-config conf/fbank_16k.conf --nj ${nj} --cmd "$train_cmd" \
        ${datadir}/${dset} ${expdir}/make_fbank/${dset} ${dumpdir}/${dset}/fbank
    utils/fix_data_dir.sh ${datadir}/${dset}
  done

  #cd ${home_path} || exit
fi

step4: get xvector speaker file

if [ $stage -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    echo "download xvector speaker pre-training model"
    #expdir=exp
    git lfs install
    git clone https://www.modelscope.cn/iic/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
    mv speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch ${expdir}/
fi
infer_cmd=utils/run.pl
inference_nj=4
# number of jobs for inference
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=4
ngpu=1
inference_nj=$[${ngpu}*${njob}]
_ngpu=1
gpuid_list="0"

if [ $stage -le 7 ] && [ ${stop_stage} -ge 7 ]; then
  sv_exp_dir=$expdir/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch
  sed "s/input_size: null/input_size: 80/g" ${sv_exp_dir}/sv.yaml > ${sv_exp_dir}/sv_fbank.yaml
  for dset in  Eval/${version}/nonoverlap_0s; do
    key_file=${datadir}/${dset}/feats.scp
    num_scp_file="$(<${key_file} wc -l)"
    _nj=$([ $inference_nj -le $num_scp_file ] && echo "$inference_nj" || echo "$num_scp_file")
    _logdir=${dumpdir}/${dset}/xvecs
    mkdir -p ${_logdir}
    split_scps=
    for n in $(seq "${_nj}"); do
        split_scps+=" ${_logdir}/keys.${n}.scp"
    done
    # shellcheck disable=SC2086
    utils/split_scp.pl "${key_file}" ${split_scps}

    ${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/sv_inference.JOB.log \
      python3 -m funasr.bin.sv_inference_launch \
        --batch_size 1 \
        --njob ${njob} \
        --ngpu "${_ngpu}" \
        --gpuid_list ${gpuid_list} \
        --data_path_and_name_and_type "${key_file},speech,kaldi_ark" \
        --key_file "${_logdir}"/keys.JOB.scp \
        --sv_train_config ${sv_exp_dir}/sv_fbank.yaml \
        --sv_model_file ${sv_exp_dir}/sv.pth \
        --output_dir "${_logdir}"/output.JOB
    cat ${_logdir}/output.*/xvector.scp | sort > ${datadir}/${dset}/utt2xvec
  done
fi
if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
  echo "Stage 8: Generate label files."

  for dset in Eval/${version}; do
    echo "Stage 8: Generate labels for ${dset}."
    python3 -Wignore script/calc_real_meeting_frame_labels.py \
          ${datadir}/${dset} ${dumpdir}/${dset}/labels \
          --n_spk 4 --frame_shift 0.01 --nj $nj --sr $sr
    find `pwd`/${dumpdir}/${dset}/labels/ -iname "*.lbl.mat" | awk -F'[/.]' '{print $(NF-2),$0}' | sort > ${datadir}/${dset}/labels.scp
  done
fi

if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ];then
   # dump alimeeting eval data in test mode.
  data_dir=${datadir}/Eval/${version}/files_for_dump
  mkdir ${data_dir}
  # filter out zero duration segments
  LC_ALL=C awk '{if ($5 > 0){print $0}}' ${datadir}/Eval/${version}/ref.rttm > ${data_dir}/ref.rttm
  cp ${datadir}/Eval/${version}/{feats.scp,labels.scp} ${data_dir}/
  cp ${datadir}/Eval/${version}/nonoverlap_0s/{utt2spk,utt2xvec,utt2num_frames} ${data_dir}/

  #echo "Stage 8: start to dump for alimeeting."
  echo "Stage 9: start to dump for alimeeting."
  python3 -Wignore script/dump_meeting_chunks.py --dir ${data_dir} \
    --out ${dumpdir}/Eval/${version}/dumped_files/data --n_spk 16 --no_pbar --sr $sr --mode test \
    --chunk_size 1600 --chunk_shift 400 --add_mid_to_speaker true

  mkdir -p ${datadir}/Eval/${version}/dumped_files
  cat ${dumpdir}/Eval/${version}/dumped_files/data_parts*_feat.scp | sort > ${datadir}/Eval/${version}/dumped_files/feats.scp
  cat ${dumpdir}/Eval/${version}/dumped_files/data_parts*_xvec.scp | sort > ${datadir}/Eval/${version}/dumped_files/profile.scp
  cat ${dumpdir}/Eval/${version}/dumped_files/data_parts*_label.scp | sort > ${datadir}/Eval/${version}/dumped_files/label.scp
  mkdir -p ${expdir}/alimeeting_eval_states
  awk '{print $1,"1600"}' ${datadir}/Eval/${version}/dumped_files/feats.scp | shuf > ${expdir}/alimeeting_eval_states/speech_shape
  python3 -Wignore script/convert_rttm_to_seg_file.py --rttm_scp ${data_dir}/ref.rttm --seg_file ${data_dir}/org_vad.txt

fi
# evaluate for pretrained model
if [ ${stage} -le 11 ] && [ ${stop_stage} -ge 11 ]; then
    echo "stage 11: evaluation for phase-1 model."
    test_sets=Eval/${version}
    # inference related
    inference_model=sond.pb # offical release sond model
    #inference_config=exp/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml
    model_dir=speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch
    for dset in ${test_sets}; do
        echo "Processing for $dset"
        exp_model_dir=${expdir}/${model_dir}
        #_inference_tag="$(basename "${inference_config}" .yaml)${inference_tag}"
        #_dir="${exp_model_dir}/${_inference_tag}/${inference_model}/${dset}"
        _dir=${exp_model_dir}/${dset}
        _logdir="${_dir}/logdir"
        if [ -d ${_dir} ]; then
            echo "WARNING: ${_dir} is already exists."
        fi
        mkdir -p "${_logdir}"
        _data="${datadir}/${dset}/dumped_files"
        key_file=${_data}/feats.scp
        num_scp_file="$(<${key_file} wc -l)"
        _nj=$([ $inference_nj -le $num_scp_file ] && echo "$inference_nj" || echo "$num_scp_file")
        split_scps=
        for n in $(seq "${_nj}"); do
            split_scps+=" ${_logdir}/keys.${n}.scp"
        done
        _opt=
        if [ ! -z "${inference_config}" ]; then
          _opt="--config ${inference_config}"
        fi
        # shellcheck disable=SC2086
        utils/split_scp.pl "${key_file}" ${split_scps}
        echo "Inference log can be found at ${_logdir}/inference.*.log"
        ${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/inference.JOB.log \
            python3 -m funasr.bin.diar_inference_launch \
                --batch_size 1 \
                --ngpu "${_ngpu}" \
                --njob ${njob} \
                --gpuid_list ${gpuid_list} \
                --data_path_and_name_and_type "${_data}/feats.scp,speech,kaldi_ark" \
                --data_path_and_name_and_type "${_data}/profile.scp,profile,kaldi_ark" \
                --key_file "${_logdir}"/keys.JOB.scp \
                --diar_train_config "${exp_model_dir}"/sond_fbank.yaml \
                --diar_model_file "${exp_model_dir}"/"${inference_model}" \
                --output_dir "${_logdir}"/output.JOB \
                --mode sond
        #${_opt}
    done
fi

if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
  echo "stage 12: Scoring phase-1 models"
  if [ ! -e dscore ]; then
    git clone https://github.com/nryant/dscore.git
    # add intervaltree to setup.py
  fi
fi

if [ ${stage} -le 13 ] && [ ${stop_stage} -ge 13 ]; then
  test_sets=Eval/${version}
  model_dir=speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch
  for dset in ${test_sets}; do
    echo "stage 13: Scoring for ${dset}"
    diar_exp=${expdir}/${model_dir}
    _data="${datadir}/${dset}"
    #_inference_tag="$(basename "${inference_config}" .yaml)${inference_tag}"
    #_dir="${diar_exp}/${_inference_tag}/${inference_model}/${dset}"
    _dir=${diar_exp}/${dset}
    _logdir="${_dir}/logdir"
    cat ${_logdir}/*/labels.txt | sort > ${_dir}/labels.txt

    python3  script/convert_label_to_rttm.py \
        ${_dir}/labels.txt \
        ${datadir}/${dset}/files_for_dump/org_vad.txt \
        ${_dir}/sys.rttm \
        --ignore_len 10 \
        --no_pbar \
        --smooth_size 83 \
        --vote_prob 0.5 \
        --n_spk 16
    # echo ${cmd}
    #eval ${cmd}
    ref=${datadir}/${dset}/files_for_dump/ref.rttm
    sys=${_dir}/sys.rttm.ref_vad
    #OVAD_DER=$(python3 -Wignore dscore/score.py -r $ref -s $sys --collar 0.25 2>&1 | grep OVERALL | awk '{print $4}')
    python3 -Wignore dscore/score.py -r $ref -s $sys --collar 0.25

    ref=${datadir}/${dset}/files_for_dump/ref.rttm
    sys=${_dir}/sys.rttm.sys_vad
    #SysVAD_DER=$(python3 -Wignore dscore/score.py -r $ref -s $sys --collar 0.25 2>&1 | grep OVERALL | awk '{print $4}')
    python3 -Wignore dscore/score.py -r $ref -s $sys --collar 0.25

    #echo -e "${inference_model} ${OVAD_DER} ${SysVAD_DER}" | tee -a ${_dir}/results.txt
  done
fi
shanguanma commented 1 month ago

@ZhihaoDU , Any comments?

shanguanma commented 1 month ago

@ZhihaoDU ,No further hints?