Closed swapniljadhav1921 closed 3 years ago
Google, Facebook. HuggingFace do benchmarking on Tatoeba .. its a small sentence dataset and it won't give very drastically different translations.
at the time of writing this comment,
Tatoeba has
291 pairs for English - Telugu (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=tel) 11,006 pairs for English - Hindi (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=hin) 234 pairs for English - Kannada (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=kan) 390 pairs for English - Tamil (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=tam) 42,694 pairs for English - Marathi (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=mar) 889 pairs for English - Malayalam (https://tatoeba.org/eng/sentences/search?query=&from=eng&to=mal)
Except for English - Hindi, English - Marathi the data present for other sentence pairs is mostly very trivial sentences with 2-3 words. The accuracy results on this will not be useful to anyone. For some of the translation pairs supported by Anuvaad there is no (or very very less) parallel data available publicly.
The BLEU scores on these pairs will never correspond to real world translation quality (which will be significantly lower).
It will add more trust in users who are going to use it and invest time in.
I don't see how that would help anyone. Anyone looking to use any pre-trained model, should evaluate it on some sample data relevant to their domains. I don't understand how adding some scores on some test datasets help anyone in evaluating if the pre-trained model works for them.
In fact, I don't provide accuracy numbers for any of my open-source projects (https://github.com/notAI-tech/NudeNet, https://github.com/notAI-tech/deepsegment, https://github.com/notAI-tech/DeepTranslit, https://github.com/notAI-tech/LogoDet) because I believe evaluating pre-trained model(s) on data relevant to one's use-case is important. I don't care if someone doesn't want to use one of my repos because there are no accuracy numbers available. I get literally nothing from people using my repos. I built them for my own use-cases and open-sourced them cause someone might benefit from them.
Having said that, since the Tatoeba dataset and the Anuvaad models are open-source, anyone is welcome to do the tests themselves and report the accuracy scores. I will add them in the README.. I just refuse to do them because I don't see how they are helpful.
Seems like you took my comments negatively. I hope that is not the case. I wrote from general model open sourcing point of view. From user point of view -> There are X libraries ... with some benchmark I will choose top 2-3 and then evaluate instead of checking all. But again, your repo your rules. I will check for my curiosity. Closing the issue for now. Thanks.
@swapniljadhav1921 I apologize for the tone I used in my reply. I completely understand where you are coming from. I am currently running the benchmarks on Tatoeba (I still feel the numbers are not a real indication of the quality of transcriptions, but I realise that a lot of users do want to see them.) and will update the repo in a few hours with the scores and predictions on the same.
I will keep this issue open till then.
again, my sincere apologies for the tone of my reply.
@swapniljadhav1921 I updated the readme with the scores and the data, scripts are present at https://github.com/notAI-tech/Anuvaad-testing-scripts .
I am closing this issue for now. Feel free to re-open if needed.
Hi,
I read your point of view here But still in-depth visual inspection and evaluation is not possible .. as with language there are many possibilities & variation. Hence benchmarking should be done regardless.
Google, Facebook. HuggingFace do benchmarking on Tatoeba .. its a small sentence dataset and it won't give very drastically different translations. I suggest you guys should post these BLEU Scores with your models. It will add more trust in users who are going to use it and invest time in.
Thanks.