Allow use of multiple dictionary files, combined at run time

iskunk commented 4 years ago

It would be convenient to be able to add OOV words to a small, user-specific dictionary file, rather than have to modify the very large base dictionary. This would become possible if the aligner frontend allowed the user to specify multiple -d dictionary files, which are then combined and sorted together at run time.

An additional nicety would be to allow the files to remain unsorted on disk (it may be useful to store the words in some non-alphabetical order), as well as automatically filter out blank lines and # comment lines (which may be used to organize and document the dictionary in some manner).

kylebgorman commented 4 years ago

I’d accept a pull request to this effect if you want to implement it.

I see how it’s slightly easier for the user but to me the utility is really minor. (Also I don’t do that much forced alignment myself.)

On Mon, Jan 13, 2020 at 1:54 AM Daniel Richard G. notifications@github.com wrote:

It would be convenient to be able to add OOV words to a small, user-specific dictionary file, rather than have to modify the very large base dictionary. This would become possible if the aligner frontend allowed the user to specify multiple -d dictionary files, which are then combined and sorted together at run time.

An additional nicety would be to allow the files to remain unsorted on disk (it may be useful to store the words in some non-alphabetical order), as well as automatically filter out blank lines and # comment lines (which may be used to organize and document the dictionary in some manner).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/76?email_source=notifications&email_token=AABG4OO5AS2H2OFFHT65J7DQ5QF2PA5CNFSM4KF5JL3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IFVC5SA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OK2RUVKXFF3XBLB4UTQ5QF2PANCNFSM4KF5JL3A .

iskunk commented 4 years ago

I can put something together. Tell me if you agree with this approach:

Have all multiple-dictionary logic in aligner/__main__.py; Corpus still takes a single file for opts['dictionary']
If a single dictionary file is given, it is assumed to be sorted and devoid of blank/comment lines
If two or more dictionary files are given, they are combined, filtered, and sorted into a temporary file that is then passed in as opts['dictionary'] to Corpus.

Alternative: Add a new option (e.g. -D) that is equivalent to -d, only it can be specified multiple times (-d continues to be a singleton), and it explicitly requests the additional processing.

kylebgorman commented 4 years ago

Hmm I don't like the bifurcation in the design where they're sort of combined behind the scenes but only from __main__. Would prefer for it to happen everywhere. I would prefer to just have -d become a "nargs=+" type (no reason to add a new option -D, that I see) I think.

opts["dictionary"] should also be able to handle multiple dictionaries since YAML does lists just fine.

Obviously documentation will also have to reflect this change---just delete the suggestion to cat and re-sort and say that multiple ones can be specified.

K

On Mon, Jan 13, 2020 at 5:43 PM Daniel Richard G. notifications@github.com wrote:

I can put something together. Tell me if you agree with this approach:

Have all multiple-dictionary logic in aligner/main.py; Corpus still takes a single file for opts['dictionary']

If a single dictionary file is given, it is assumed to be sorted and devoid of blank/comment lines

If two or more dictionary files are given, they are combined, filtered, and sorted into a temporary file that is then passed in as opts['dictionary'] to Corpus.

Alternative: Add a new option (e.g. -D) that is equivalent to -d, only it can be specified multiple times (-d continues to be a singleton), and it explicitly requests the additional processing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/76?email_source=notifications&email_token=AABG4ONW3PRZZTK6S4AJ63LQ5TVC3A5CNFSM4KF5JL3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI2SMJQ#issuecomment-573908518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OIZAN2NV7S7JPDFJU3Q5TVC3ANCNFSM4KF5JL3A .

iskunk commented 4 years ago

The reason why I suggested this approach was to avoid pre-processing the dictionary in the (common?) case where the user is just using the stock dictionary. I don't think it would be onerous if this happened, however; the toolchain is already using temp files anyway.

opts['dictionary'] could be a list. I see that HDMan appears to accept multiple dictionary files:

USAGE: HDMan [options] newDict srcDict1 srcDict2 ...

A couple questions regarding this program:

If you give multiple dictionaries to this tool, is it equivalent to a single combined dictionary?
If both dictionaries contain a given word, which one wins?
I assume multiple dictionaries will need to be each in sorted order?

kylebgorman commented 4 years ago

Don't know the answer to any of those (good) questions; HDMan is even more of a mystery to me than the rest of HTK. Guess we'll jsut have to test?

On Mon, Jan 13, 2020 at 7:54 PM Daniel Richard G. notifications@github.com wrote:

The reason why I suggested this approach was to avoid pre-processing the dictionary in the (common?) case where the user is just using the stock dictionary. I don't think it would be onerous if this happened, however; the toolchain is already using temp files anyway.

opts['dictionary'] could be a list. I see that HDMan appears to accept multiple dictionary files:

USAGE: HDMan [options] newDict srcDict1 srcDict2 ...

A couple questions regarding this program:

If you give multiple dictionaries to this tool, is it equivalent to a single combined dictionary?

If both dictionaries contain a given word, which one wins?

I assume multiple dictionaries will need to be each in sorted order?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/76?email_source=notifications&email_token=AABG4OKNLNKU7XN3SK7IR5LQ5UEM3A5CNFSM4KF5JL3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI23U3A#issuecomment-573946476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJDL5ECBACKMOIWTMLQ5UEM3ANCNFSM4KF5JL3A .

iskunk commented 4 years ago

Okay, I've put together a pull request (#78). Here are my findings:

HDMan accepts multiple source dictionaries, and from my limited testing, this appears to be equivalent to a single source dictionary combined using external tools.
HDMan requires each source dictionary to be sorted.
HDMan appears to filter out blank lines as well as comments (although I don't see anything about it treating a semicolon as a comment marker; look at the two SkipComment() functions in the HTK source).

I presume you have much more test material than I do; it would be great if you could double-check that breaking apart one large dictionary into several will still yield the same results.

Because HDMan already seems to be relatively flexible in terms of processing source dictionaries, I punted on the "sort the dictionaries (together) and write out temp file(s)" part. It's at the point of diminishing returns, IMO.

Please look through my changes, and let me know your thoughts.

kylebgorman commented 4 years ago

Minimal workable example of "+" nargs... so long as you provide at least one argument after -d, args.dictionary is a list.

import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-d", "--dictionary", nargs="+")
    print(parser.parse_args())

kylebgorman commented 4 years ago

Marking closed since I believe #78 took care of things?

prosodylab / Prosodylab-Aligner

Allow use of multiple dictionary files, combined at run time #76