jedgusse / project_lorenzo

0 stars 0 forks source link

Current splitting has very high variance in number of words... #8

Open emanjavacas opened 7 years ago

emanjavacas commented 7 years ago

... given the high variance in the document length.

We should think of a less wasteful way of solving this than just cropping documents to a fixed max length.

mikekestemont commented 7 years ago

this is going to be tricky (getting a balanced dataset is really hard in AA), but I understand that you want all the data you get for training the LM.

Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Tue, May 9, 2017 at 4:18 PM, Enrique Manjavacas <notifications@github.com

wrote:

... given the high variance in the document length.

We should think of a less wasteful way of solving this than just cropping documents to a fixed max length.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL5zGOyenMCTod9qF46jEQrMTew-7ks5r4HXQgaJpZM4NVYx3 .