liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
8 stars 5 forks source link

Create simple evaluation framework and val dataset #22

Closed liyongsea closed 1 year ago

liyongsea commented 1 year ago

data format:

There's a distinction between the line breaks introduced by a text editor and the line breaks that naturally occur in paragraphs:

Hard line break (or hard return): This is when you manually press the "Enter" or "Return" key while typing in a text editor. This introduces a new line or paragraph. The term "hard" is used because the break is explicit and unchanging, regardless of the size or shape of the viewing window.

Soft line break (or soft return): This is a line break that's automatically inserted by the text editor when the text reaches the right margin of the viewing or page area. This line break is 'soft' because it automatically adjusts if you change the window size or the page layout. For example, in word processing software, if you resize the window, the software will automatically rewrap the lines to fit the new width.

liyongsea commented 1 year ago

https://github.com/liyongsea/parallel_corpus_mnbvc/pull/20/files#diff-9059cb17642979ca84961b98ea91db304811cdb567252d2c7a92c9ce6dd0bfb5

liyongsea commented 1 year ago

solve by https://github.com/liyongsea/parallel_corpus_mnbvc/pull/20