Current splitting has very high variance in number of words...

this is going to be tricky (getting a balanced dataset is really hard in AA), but I understand that you want all the data you get for training the LM.

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Tue, May 9, 2017 at 4:18 PM, Enrique Manjavacas <notifications@github.com

wrote:

... given the high variance in the document length.

We should think of a less wasteful way of solving this than just cropping documents to a fixed max length.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL5zGOyenMCTod9qF46jEQrMTew-7ks5r4HXQgaJpZM4NVYx3 .

jedgusse / project_lorenzo

Current splitting has very high variance in number of words... #8