probcomp / PClean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning
216 stars 32 forks source link

Accurate name prior #20

Open marcoct opened 3 years ago

marcoct commented 3 years ago

A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:

PClean.@model CustomerModel begin

    @class FirstNames begin
        name ~ StringPrior(1, 60, all_given_names)
    end

    @class Person begin
        given_name ~ FirstNames
    end;

    @class Obs begin
        begin
            person ~ Person
            given_name ~ AddTypos(person.given_name.name)
        end
    end;

end;

query = @query CustomerModel.Obs [
    given_name person.given_name.name given_name
];

observations = [ObservedDataset(query, df)]
config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true)
@time begin 
    tr = initialize_trace(observations, config);
    run_inference!(tr, config)
end

Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".

Intermediate steps could be to

Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:

Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.