Open jelmervdl opened 1 year ago
line\nnoise
) - that doesn't seem to fit in the SentencePair
structure. Is the plan for modifiers to return a list of SentencePairs, and add all of them to the batch?What may be conceptually nicer is have noise behaving more like a dataset, since it is a source of 'data'
sources:
- dataset: "clean-alt"
columns:
- path: data.tsv.gz
column: 0
type: moses:en
- path: data.tsv.gz
column: 1
type: moses:de
- path: alignments.gz
column: 0
type: alignments
- noise: "noise"
ranges:
- "Basic Latin"
- "Emoji"
start:
- clean-alt 0.99
- noise 0.01
- crawled 0.00
- until noise X # until X 'epochs' of noise
This would require:
(It may also be useful to have modifiers emit multiple SentencePairs as well)
I proposed Noise as a dataset to Nick as well, I agree with you that his makes the most sense.
noise
epochs there are.__start__
from the Tags modifier. A SentencePair from the Noise dataset could then just be a SentencePair made entirely from non-modifiable tokens.This would cover our current need for having modifiers emit multiple SentencePairs, but would not provide a solution for modifiers that remove SentencePairs (e.g. as the Tags modifier should do when it encounters bad alignment info.) But that can also be solved by making modifiers be modifier(pair:SentencePair) -> Optional[SentencePair]
.
I'm trying to come up with a scenario in which you'd want your modifier to behave like modifier(pair:SentencePair) -> list[SentencePair]
that isn't better served by the fake dataset implementation. Something where you'd want to generate multiple sentences from a single one? Do you ever want to increase exposure to something based on a single sentence to train a particular phenomenon?
Yes, keeping a mapping of non-modifiable is more flexible. Perhaps just SentencePair.is_modifiable() = any(self.modifiable)
just as a helper so that pairs that have exhausted tokens to modify can be skipped easily.
Obviously Optional[SentencePair]
is a stronger hint, but implementation wise I am not against an empty list fulfilling the optionality.
I found this hard too; maybe you're training a bi-directional model and you want e.g (en,it)
in the same batch as (it,en)
. I think we probably need to keep the language along with the sentence pairs anyway to handle multi-lingual training. This would open up language-tagging modifiers (en, [it] it)
+ (it, [en] en)
and ([it] en, it)
+ ([en] it, en)
Just to add to this, it may useful to have a modifier ingest List[SentencePair]
. An example usage may be joining a two sentences to into a single one.
This comes mostly from me working on re-alignment. I'm moving the grand design I had in #26 to here, and make that pull request more about just supporting alignment info passthrough.
Additional niceties from this:
__source__
and__target__
tag and other modifiers can easily skip around those without having to have a complete list of tokens they can't touch.So here's the plan:
split()
and' '.join()
.tokens
strategy for the trainer. Same default, so when unspecified tokens will be passed to the trainer as-is. However, you can use this to detokenize moses tokens or retokenize them into marian's "I want plain text, but the alignment info needs to be on the spm token level".Implementation wise:
SentencePair
type (see below) which holds the tokens and alignment infoSentencePair
, and it will be easy to make sure alignment info stays valid while manipulating such a pair.Retokenize
modifier that can be used to change between tokenisations. This for example can help with tokenising Chinese in case it wasn't tokenised like words, or to add tokenisation to a dataset that didn't have it (but you should just update the file then, right?)yaml:
Current yaml:
... will be interpreted as
Supported types so far:
text
which uses theSpaceTokenizer
which just doestext.split()
and' '.join(tokens)
.moses:{lang}
uses sacremosesspm:{vocab}
uses sentencepiece. Right now the actual tokens that come out of this arent used except for adjusting all the alignment indices so they map from moses tokens to if the text were spm tokens (no sampling though)alignments
just parses{m}-{n}
pairs intolist[Pair]
.optional-alignments
does the same, except it will returnNone
if there is no third column. If there is an empty third column, it will return[]
. (TODO: it this necessary? This is just here so we can still automatically deal with 2-col and 3-col data without having to specify it in the yaml explicitly.)Possible alternative yaml (the above can be a shorthand for this):
Also something for specifying which tokens the trainer uses:
Sentence Pair structure: