Closed PonteIneptique closed 1 year ago
Thanks, I wanted to write some integration tests for the ptl modules but stopped after ending up with waay too many cases. It's a good start and I'll probably add some more in the future.
Thanks for the merge :)
Just a quick note: I am still quite unsatisfied on the naming scheme of codec merging. Specifically, add
and both
are quite synonym to me (add
adds the new characters, both
uses both set of character [while it actually only uses the new dataset vocabulary]). While it might be a breaking change, I would very much recommend moving to add
and replace
or something like this.
To do something clean regarding both
changing to something such as replace
, we could allow both argument with a DeprecationWarning and map to replace
I support this and would support also the possibility to train excluding new characters. Something that would be clear and not break previous behavior could be: old_only new_only (corresponds to today's both) old_and_new (corresponds to today's add)
Maybe we should open a specific issue for this ;) See https://github.com/mittagessen/kraken/issues/478
Hi there :) We were tracking some bug the other day and we prepared some dummy dataset to track it. I thought it would be good to kinda propose integrated tests which checks for a suite of combination (Unicode normalization, Fine-Tuning, Arrows or XML).
I left the data to generate the arrow in case arrow change one day and we need to regenerate the arrows.
These checks a things a little different than
test_train.py
, so I hope you will see it as useful.