Accurate name prior - Githubissues

A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:

PClean.@model CustomerModel begin

    @class FirstNames begin
        name ~ StringPrior(1, 60, all_given_names)
    end

    @class Person begin
        given_name ~ FirstNames
    end;

    @class Obs begin
        begin
            person ~ Person
            given_name ~ AddTypos(person.given_name.name)
        end
    end;

end;

query = @query CustomerModel.Obs [
    given_name person.given_name.name given_name
];

observations = [ObservedDataset(query, df)]
config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true)
@time begin 
    tr = initialize_trace(observations, config);
    run_inference!(tr, config)
end

Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".

Intermediate steps could be to

Train a more accurate n-gram text model that is trained on a data set of names.
Train or find an existing deep generative model for names.

Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:

Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).
Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).
Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.

Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.

probcomp / PClean

Accurate name prior #20