paracrawl / Domain_Adaptation

InDomain detection is a tool designed to extract in-domain data from a large collections of data.
GNU General Public License v3.0
1 stars 1 forks source link

Implement Moore-Lewis for selection #26

Closed wwaites closed 5 years ago

wwaites commented 5 years ago

Amir writes:

The selection is based on sentence length normalized language model
scoring of the crawl data using a monolingual source in-domain corpus. Well
in my opinion that is not a good model for data selection. We can at least
implement modified Moore-Lewis, which is kind of the baseline method for
data selection.
dionwiggins commented 5 years ago

Hi Amir,

This is the first version. The plan is to implement a series of different scoring solutions that can be selected in future versions. We looked at other approaches such as models on the source sample and the pool data, but we decided to go simple to start. This is because there is an issue of size of the data (i.e. 5.6 billion sentences of EN-DE Paracrawl) the user may want to use different sizes. If the second model was built of the full set of large pool data, it could cause real issues in terms of time, storage, memory etc. For example to tokenize 5.6 billion sentences and then train a second model on the pool data will likely blow both memory and disk.

This is the V1 approach and we will look at packing others not far down the road. I already had a discussion with Philipp on the training a second model from the pool data and looking at optimal ways to do that for version 2.

Regards,

Dion Wiggins Founder and CTO Omniscien Technologies

Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com

NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.

From: wwaites notifications@github.com Sent: Thursday, May 23, 2019 4:22 PM To: paracrawl/Domain_Adaptation domain_adaptation@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [paracrawl/Domain_Adaptation] Implement Moore-Lewis for selection (#26)

Amir writes: The selection is based on sentence length normalized language modelscoring of the crawl data using a monolingual source in-domain corpus. Wellin my opinion that is not a good model for data selection. We can at leastimplement modified Moore-Lewis, which is kind of the baseline method fordata selection.— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kpu commented 5 years ago

"This is because there is an issue of size of the data (i.e. 5.6 billion sentences of EN-DE Paracrawl) the user may want to use different sizes. If the second model was built of the full set of large pool data, it could cause real issues in terms of time, storage, memory etc. For example to tokenize 5.6 billion sentences and then train a second model on the pool data will likely blow both memory and disk."

Bullshit. GNU parallel the tokenizer and you don't need the whole pool data in RAM. That you have it in RAM is a design defect.

dionwiggins commented 5 years ago

Done