Different results for e1 and e2

We were able to get e0 - e4 running on our hardware setup. To get around the memory issue we stored the dataset in disk list, a drop in replacement for the standard python list which stores the data on disk instead of memory. We are running the training processes with the commands supplied on GitHub. The results we got are in the attached files.

We noticed for e1 and e2 that our scores are lower than what was reported in the paper. We were wondering if our results are around what you would expect, or if there could be an issue with our setup. Our end goal is to reproduce the results in the paper, so any advice on how we could modify the code to be more inline with the experiments in the paper would be much appreciated.

e1.txt e4.txt e3.txt e2.txt e0.txt

We very much welcome your reproduction attempt!

I start by looking at your e4 and compare to our E4s (which I assume is what you do)

test_sign_segment_IoU. 0.6758992075920105
test_sentence_segment_IoU. 0.7973689436912537

The sentece IoU is bang on 79% but the sign IoU is higher than expected (67% compared to 63% reported in the paper)

Looking at e1

│     test_sentence_segment_IoU     │        0.6340624690055847         │
│       test_sign_segment_IoU       │        0.5634394884109497         │

You get 63% on sentences, which we never do. We always get more (except for E5), and 56% on sign, which is lower than our 66%.

Your f1 however, in both cases matches the one in our paper.

Now that all the facts are there, some things to check:

At what epoch each model stopped? I suspect that E1 might just stop early
What threshold values do you use for the IoU function? I believe that when we wrote the paper, until we got to the * experiments, we used 50 and 50 (as specified here) - different values can lead to different IoUs
Did you make any changes to the code except for swapping the memory with disk?

Finally, I share with you the code for our table. If you fill in your numbers we might be able to spot some stark/consistent difference.

\begin{tabular}{llccc|ccc|cc}
\toprule
 & & \multicolumn{3}{c}{\textbf{Sign}} & \multicolumn{3}{c}{\textbf{Phrase}} & \multicolumn{2}{c}{\textbf{Efficiency}} \\
\cmidrule(lr){3-5} \cmidrule(lr){6-8} \cmidrule(lr){9-10}
\multicolumn{2}{l}{\textbf{Experiment}} & \textbf{F1} & \textbf{IoU} & \textbf{\%} & \textbf{F1} & \textbf{IoU} & \textbf{\%} & \textbf{\#Params} & \textbf{Time} \\
\midrule

\textbf{E0} & \textbf{\citet{detection:moryossef2020real}} & --- & $0.46$ & $1.09$ & --- & $0.70$ & $\textbf{1.00}$ & \textbf{102K} & \textbf{0:50:17}\\
\midrule
\textbf{E1} & \textbf{Baseline} & $0.56$ & $0.66$ & $0.91$ & $0.59$ & $0.80$ & $2.50$ & 454K & 1:01:50\\
\textbf{E2} & \textbf{E1 + Face} & $0.53$ & $0.58$ & $0.64$ & $0.57$ & $0.76$ & $1.87$ & 552K & 1:50:31\\
\textbf{E3} & \textbf{E1 + Optical Flow} & $0.58$ & $0.62$ & $1.12$ & $0.60$ & $0.82$ & $3.19$ & 473K & 1:20:17\\
\textbf{E4} & \textbf{E3 + Hand Norm} & $0.56$ & $0.61$ & $1.07$ & $0.60$ & $0.80$ & $3.24$ & 516K & 1:30:59\\
\midrule
\textbf{E1s} & \textbf{E1 + Depth=4} & $\textbf{0.63}$ & $\textbf{0.69}$ & $1.11$ & $\textbf{0.65}$ & $0.82$ & $1.63$ & 1.6M & 4:08:48\\
\textbf{E2s} & \textbf{E2 + Depth=4} & $0.62$ & $\textbf{0.69}$ & $1.07$ & $0.63$ & $0.84$ & $2.68$ & 1.7M & 3:14:03\\
\textbf{E3s} & \textbf{E3 + Depth=4} & $0.60$ & $0.63$ & $1.13$ & $0.64$ & $0.80$ & $1.53$ & 1.7M & 4:08:30\\
\textbf{E4s} & \textbf{E4 + Depth=4} & $0.59$ & $0.63$ & $1.13$ & $0.62$ & $0.79$ & $1.43$ & 1.7M & 4:35:29\\
\midrule
\textbf{E1s*} & \textbf{E1s + Tuned Decoding} & --- & \textbf{0.69} & \textbf{1.03} & --- & \textbf{0.85} & 1.02 & --- & ---\\
\textbf{E4s*} & \textbf{E4s + Tuned Decoding} & --- & 0.63 & 1.06 & --- & 0.79 & 1.12 & --- & ---\\
\midrule
\textbf{E5} & \textbf{E4s + Autoregressive} & $0.45$ & $0.47$ & $0.88$ & $0.52$ & $0.63$ & $2.72$ & 1.3M & \textasciitilde3 days\\

\bottomrule
\end{tabular}

We reran E1 and E2 and created the following table which contains our values along with the deltas to the reported values:. E1 trained over 20 epochs, E2 trained over 42 epochs, E3 trained over 80 epochs, and E4 trained over 69 epochs.

\resizebox{\linewidth}{!}{ \begin{tabular}{llccc|ccc|cc} \toprule & & \multicolumn{3}{c}{\textbf{Sign}} & \multicolumn{3}{c}{\textbf{Phrase}} & \multicolumn{2}{c}{\textbf{Efficiency}} \ \cmidrule(lr){3-5} \cmidrule(lr){6-8} \cmidrule(lr){9-10} \multicolumn{2}{l}{\textbf{Experiment}} & \textbf{F1} & \textbf{IoU} & \textbf{\%} & \textbf{F1} & \textbf{IoU} & \textbf{\%} & \textbf{Time} \ \midrule

\textbf{E0} & \textbf{\citet{10.1007/978-3-030-66096-3_17}} & --- & $0.48$ ($+0.02$) & $1.24$ ($+0.15$) & --- & $0.70$ ($0$) & $1.07$ ($+0.07$) & \textbf{4:29:01}\ \midrule \textbf{E1} & \textbf{Baseline} & $0.52$ ($-0.04$) & $0.58$ ($-0.08$) & $0.83$ ($-0.04$) & $0.53$ ($-0.06$) & $0.71$ ($-0.09$) & $2.93$ ($+0.43$) & 2:48:04\ \textbf{E2} & \textbf{E1 + Face} & $0.50$ ($-0.03$) & $0.30$ ($-0.28$) & $0.40$ ($-0.24$) & $0.57$ & $0.47$ ($-0.29$) & $0.30$ ($-1.57$) & 6:06:00\ \textbf{E3} & \textbf{E1 + Optical Flow} & $0.61$ ($+0.03$) & $0.68$ ($+0.06$) & $1.16$ ($+0.04$) & $0.60$ & $0.79$ ($-0.03$) & $3.39$ ($+0.20$) & 17:26:02\ \textbf{E4} & \textbf{E3 + Hand Norm} & $0.56$ & $0.68$ ($+0.07$) & $1.28$ ($+0.21$) & $0.60$ & $0.80$ & $3.82$ ($+0.58$) & 15:25:10\

\bottomrule \end{tabular} }

Our E1 values seem closer to the reported values than the first time we ran it, however e2 is still quite different.

Our time values are also quite off, but this can be explained by the use of disklist instead of memory.

We doublechecked our code and as far as we can tell nothing was modified apart from the addition of disklist. Our IoU thresholds should be 50 and 50.

So we are comparing our results

To yours

I think I am personally more surprised by your higher IoU for E4 than by the lower scores. both are suspicious.

In any case, after comparing the commands (E1, E2, E3, and E4), I find that E2 has one thing that is unique:

python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --pose_components POSE_LANDMARKS LEFT_HAND_LANDMARKS RIGHT_HAND_LANDMARKS FACE_LANDMARKS --pose_reduce_face=true

Specifically, the pose_components and pose_reduce_face. I suspect there might be something strange about the parsing of the arguments. Can you please make sure that when you run this command on your computer, if you print the arguments (args) you see:

{
"pose_components": ["POSE_LANDMARKS", "LEFT_HAND_LANDMARKS", "RIGHT_HAND_LANDMARKS", "FACE_LANDMARKS"],
"pose_reduce_face": True
}

Tagging @J22Melody to see if there is something else to check.

Here are the arguments being printed for each experiment. I've also attached the complete log files. It looks like these arguments are set correctly for e2.

E0:

Agruments: Namespace(no_wandb=True, run_name=None, wandb_dir='.', seed=42, device='gpu', gpus=1, epochs=100, patience=20, batch_size=8, batch_size_devtest=20, learning_rate=0.001, lr_scheduler='none', dataset='dgs_corpus', data_dir='.', data_dev=False, fps=25, pose='holistic', pose_components=['POSE_LANDMARKS', 'LEFT_HAND_LANDMARKS', 'RIGHT_HAND_LANDMARKS'], pose_reduce_face=False, hand_normalization=False, optical_flow=True, only_optical_flow=True, classes='io', pose_projection_dim=256, hidden_dim=64, encoder_depth=1, encoder_bidirectional=False, encoder_autoregressive=False, weighted_loss=False, b_threshold=50, o_threshold=50, threshold_likeliest=False, train=True, test=True, save_jit=False, checkpoint=None, pred_output=None, ffmpeg_path=None)

E1:

Agruments: Namespace(no_wandb=True, run_name=None, wandb_dir='.', seed=42, device='gpu', gpus=1, epochs=100, patience=20, batch_size=8, batch_size_devtest=20, learning_rate=0.001, lr_scheduler='none', dataset='dgs_corpus', data_dir='.', data_dev=False, fps=25, pose='holistic', pose_components=['POSE_LANDMARKS', 'LEFT_HAND_LANDMARKS', 'RIGHT_HAND_LANDMARKS'], pose_reduce_face=False, hand_normalization=False, optical_flow=False, only_optical_flow=False, classes='bio', pose_projection_dim=256, hidden_dim=256, encoder_depth=1, encoder_bidirectional=True, encoder_autoregressive=False, weighted_loss=True, b_threshold=50, o_threshold=50, threshold_likeliest=False, train=True, test=True, save_jit=False, checkpoint=None, pred_output=None, ffmpeg_path=None)

E2:

Agruments: Namespace(no_wandb=True, run_name=None, wandb_dir='.', seed=42, device='gpu', gpus=1, epochs=100, patience=20, batch_size=8, batch_size_devtest=20, learning_rate=0.001, lr_scheduler='none', dataset='dgs_corpus', data_dir='.', data_dev=False, fps=25, pose='holistic', pose_components=['POSE_LANDMARKS', 'LEFT_HAND_LANDMARKS', 'RIGHT_HAND_LANDMARKS', 'FACE_LANDMARKS'], pose_reduce_face=True, hand_normalization=False, optical_flow=False, only_optical_flow=False, classes='bio', pose_projection_dim=256, hidden_dim=256, encoder_depth=1, encoder_bidirectional=True, encoder_autoregressive=False, weighted_loss=True, b_threshold=50, o_threshold=50, threshold_likeliest=False, train=True, test=True, save_jit=False, checkpoint=None, pred_output=None, ffmpeg_path=None)

E3:

Agruments: Namespace(no_wandb=True, run_name=None, wandb_dir='.', seed=42, device='gpu', gpus=1, epochs=100, patience=20, batch_size=8, batch_size_devtest=20, learning_rate=0.001, lr_scheduler='none', dataset='dgs_corpus', data_dir='.', data_dev=False, fps=25, pose='holistic', pose_components=['POSE_LANDMARKS', 'LEFT_HAND_LANDMARKS', 'RIGHT_HAND_LANDMARKS'], pose_reduce_face=False, hand_normalization=False, optical_flow=True, only_optical_flow=False, classes='bio', pose_projection_dim=256, hidden_dim=256, encoder_depth=1, encoder_bidirectional=True, encoder_autoregressive=False, weighted_loss=True, b_threshold=50, o_threshold=50, threshold_likeliest=False, train=True, test=True, save_jit=False, checkpoint=None, pred_output=None, ffmpeg_path=None)

E4:

Agruments: Namespace(no_wandb=True, run_name=None, wandb_dir='.', seed=42, device='gpu', gpus=1, epochs=100, patience=20, batch_size=8, batch_size_devtest=20, learning_rate=0.001, lr_scheduler='none', dataset='dgs_corpus', data_dir='.', data_dev=False, fps=25, pose='holistic', pose_components=['POSE_LANDMARKS', 'LEFT_HAND_LANDMARKS', 'RIGHT_HAND_LANDMARKS'], pose_reduce_face=False, hand_normalization=True, optical_flow=True, only_optical_flow=False, classes='bio', pose_projection_dim=256, hidden_dim=256, encoder_depth=1, encoder_bidirectional=True, encoder_autoregressive=False, weighted_loss=True, b_threshold=50, o_threshold=50, threshold_likeliest=False, train=True, test=True, save_jit=False, checkpoint=None, pred_output=None, ffmpeg_path=None)

e4_log.txt e3_log.txt e2_log.txt e1_log.txt e0_log.txt

We ran the experiments reported in the paper three times with seeds 1, 2, and 3, and you are running with seed 42. Looking at the standard deviation values in Table 4 in the appendix, I'd say your F1 numbers are within a reasonable range, but your IoU and % numbers are more suspicious. As @AmitMY pointed out, the latter is relevant to the decoding process which might bring more variance.

You can see in Table 4, that the IoU and % numbers are more variable (especially E2) thanF1, which is also reflected in your results. Another tendency we see is that the better the model, the less the variable the results are, so perhaps focus more on experiments without Face (E2 is in general bad) and consider reproducing E1s and so on with encoder_depth=4, if you are interested in getting better segmentation quality.

I will check your logs in detail and compare them to ours to see whether I can spot something that might lead to the different results in IoU and %.

sign-language-processing / segmentation

Different results for e1 and e2 #4