jelmervdl commented 1 year ago

This comes mostly from me working on re-alignment. I'm moving the grand design I had in #26 to here, and make that pull request more about just supporting alignment info passthrough.

For alignments you need a concept of what is a token as alignments are relations between tokens. Often these are whole words, but marian thinks in SPM tokens so needs alignment information at that level.
OpusTrainer really likes its tokens to be words so we can apply all the modifiers on what we deem to be words.
In general there's a 1-to-N mapping between words (or "moses tokens") and spm tokens. So converting alignments between moses tokens to alignments between spm tokens should be easy.

Additional niceties from this:

Currently all modifiers do their own line, column, token and alignment pair parsing and stringifying.
Having a semantic structure for tokens allows us to mark tokens as special, so you can have the placeholder modifier insert its __source__ and __target__ tag and other modifiers can easily skip around those without having to have a complete list of tokens they can't touch.

So here's the plan:

You tokenize your training data (e.g. with moses) so that every word and punctuation is split by a space. Because this is the input a word aligner likes.
You generate the word alignment info for these tokens (e.g. with fast_algin or eflomal)
In your OpusTrainer config, you specify per dataset what tokenisation is used for each of your columns. (Since we're splitting the columns here, this might well be an opportunity to allow reading columns from their own files.) By default the strategy is splitting and joining by whitespace: split() and ' '.join().
You also specify a tokens strategy for the trainer. Same default, so when unspecified tokens will be passed to the trainer as-is. However, you can use this to detokenize moses tokens or retokenize them into marian's "I want plain text, but the alignment info needs to be on the spm token level".

Implementation wise:

DatasetReaders yield a SentencePair type (see below) which holds the tokens and alignment info
Modifiers modify a SentencePair, and it will be easy to make sure alignment info stays valid while manipulating such a pair.
(Optional) There's a Retokenize modifier that can be used to change between tokenisations. This for example can help with tokenising Chinese in case it wasn't tokenised like words, or to add tokenisation to a dataset that didn't have it (but you should just update the file then, right?)

yaml:

datasets:
  clean:
    path: data.tsv.gz
    columns:
      - moses:en
      - moses:de
      - alignments

Current yaml:

datasets:
  clean: path/to/data.tsv.gz

... will be interpreted as

datasets:
  clean:
    path: path/to/data.tsv.gz
    columns:
      - text # space-separated tokens
      - text
      - optional-alignments # same as alignments but won't add a third column to the output if there's no third column in the data

Supported types so far:

text which uses the SpaceTokenizer which just does text.split() and ' '.join(tokens).
moses:{lang} uses sacremoses
spm:{vocab} uses sentencepiece. Right now the actual tokens that come out of this arent used except for adjusting all the alignment indices so they map from moses tokens to if the text were spm tokens (no sampling though)
alignments just parses {m}-{n} pairs into list[Pair].
optional-alignments does the same, except it will return None if there is no third column. If there is an empty third column, it will return []. (TODO: it this necessary? This is just here so we can still automatically deal with 2-col and 3-col data without having to specify it in the yaml explicitly.)

Possible alternative yaml (the above can be a shorthand for this):

datasets:
  clean-alt:
    columns:
      - path: data.tsv.gz
        column: 0
        type: moses:en
      - path: data.tsv.gz
        column: 1
        type: moses:de
      - path: alignments.gz
        column: 0
        type: alignments

Also something for specifying which tokens the trainer uses:

trainer-tokens: spm-alignments-only:path/to/vocab.spm

Sentence Pair structure:

class Pair(NamedTuple):
  src: int
  trg: int

class SentencePair:
  src: list[str]
  trg: list[str]
  alignments: list[Pair]

  # Thinking about adding methods for adding and removing tokens that keeps `alignments` correct.

graemenail commented 1 year ago

21 has a bit of a unique implementation in that it hides two training examples in one (`line\nnoise`) - that doesn't seem to fit in the `SentencePair` structure. Is the plan for modifiers to return a list of SentencePairs, and add all of them to the batch?

What may be conceptually nicer is have noise behaving more like a dataset, since it is a source of 'data'

sources:
  - dataset: "clean-alt"
     columns:
      - path: data.tsv.gz
        column: 0
        type: moses:en
      - path: data.tsv.gz
        column: 1
        type: moses:de
      - path: alignments.gz
        column: 0
        type: alignments
  - noise: "noise"
      ranges:
        - "Basic Latin"
        - "Emoji"

start:
  - clean-alt 0.99
  - noise 0.01
  - crawled 0.00
  - until noise X # until X 'epochs' of noise

This would require:

defining a what a noise epoch is, but this could just be a 1 example/epoch.
having the ability to mark a SentencePair as non-modifiable; we (very probably) do not want to augment the noise

(It may also be useful to have modifiers emit multiple SentencePairs as well)

jelmervdl commented 1 year ago

I proposed Noise as a dataset to Nick as well, I agree with you that his makes the most sense.

1 sentence per epoch for these fake data sources makes sense to me. I don't expect a stage to ever be conditioned on how many noise epochs there are.
My idea was to have non-modifiable tokens to wrap e.g. __start__ from the Tags modifier. A SentencePair from the Noise dataset could then just be a SentencePair made entirely from non-modifiable tokens.

This would cover our current need for having modifiers emit multiple SentencePairs, but would not provide a solution for modifiers that remove SentencePairs (e.g. as the Tags modifier should do when it encounters bad alignment info.) But that can also be solved by making modifiers be modifier(pair:SentencePair) -> Optional[SentencePair].

I'm trying to come up with a scenario in which you'd want your modifier to behave like modifier(pair:SentencePair) -> list[SentencePair] that isn't better served by the fake dataset implementation. Something where you'd want to generate multiple sentences from a single one? Do you ever want to increase exposure to something based on a single sentence to train a particular phenomenon?

graemenail commented 1 year ago

Yes, keeping a mapping of non-modifiable is more flexible. Perhaps just SentencePair.is_modifiable() = any(self.modifiable) just as a helper so that pairs that have exhausted tokens to modify can be skipped easily.

Obviously Optional[SentencePair] is a stronger hint, but implementation wise I am not against an empty list fulfilling the optionality.

I found this hard too; maybe you're training a bi-directional model and you want e.g (en,it) in the same batch as (it,en). I think we probably need to keep the language along with the sentence pairs anyway to handle multi-lingual training. This would open up language-tagging modifiers (en, [it] it) + (it, [en] en) and ([it] en, it) + ([en] it, en)

graemenail commented 1 year ago

Just to add to this, it may useful to have a modifier ingest List[SentencePair]. An example usage may be joining a two sentences to into a single one.

hplt-project / OpusTrainer

Use `SentencePair` struct instead of `str` internally #29

21 has a bit of a unique implementation in that it hides two training examples in one (`line\nnoise`) - that doesn't seem to fit in the `SentencePair` structure. Is the plan for modifiers to return a list of SentencePairs, and add all of them to the batch?

hplt-project / OpusTrainer

Use `SentencePair` struct instead of `str` internally #29

21 has a bit of a unique implementation in that it hides two training examples in one (line\nnoise) - that doesn't seem to fit in the SentencePair structure. Is the plan for modifiers to return a list of SentencePairs, and add all of them to the batch?

21 has a bit of a unique implementation in that it hides two training examples in one (`line\nnoise`) - that doesn't seem to fit in the `SentencePair` structure. Is the plan for modifiers to return a list of SentencePairs, and add all of them to the batch?