Closed wwaites closed 5 years ago
Hi Amir,
This is the first version. The plan is to implement a series of different scoring solutions that can be selected in future versions. We looked at other approaches such as models on the source sample and the pool data, but we decided to go simple to start. This is because there is an issue of size of the data (i.e. 5.6 billion sentences of EN-DE Paracrawl) the user may want to use different sizes. If the second model was built of the full set of large pool data, it could cause real issues in terms of time, storage, memory etc. For example to tokenize 5.6 billion sentences and then train a second model on the pool data will likely blow both memory and disk.
This is the V1 approach and we will look at packing others not far down the road. I already had a discussion with Philipp on the training a second model from the pool data and looking at optimal ways to do that for version 2.
Regards,
Dion Wiggins Founder and CTO Omniscien Technologies
Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com
NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.
From: wwaites notifications@github.com Sent: Thursday, May 23, 2019 4:22 PM To: paracrawl/Domain_Adaptation domain_adaptation@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [paracrawl/Domain_Adaptation] Implement Moore-Lewis for selection (#26)
Amir writes: The selection is based on sentence length normalized language modelscoring of the crawl data using a monolingual source in-domain corpus. Wellin my opinion that is not a good model for data selection. We can at leastimplement modified Moore-Lewis, which is kind of the baseline method fordata selection.— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
"This is because there is an issue of size of the data (i.e. 5.6 billion sentences of EN-DE Paracrawl) the user may want to use different sizes. If the second model was built of the full set of large pool data, it could cause real issues in terms of time, storage, memory etc. For example to tokenize 5.6 billion sentences and then train a second model on the pool data will likely blow both memory and disk."
Bullshit. GNU parallel the tokenizer and you don't need the whole pool data in RAM. That you have it in RAM is a design defect.
Done
Amir writes: