magnusdv / pedprobr

Pedigree probabilities in R
GNU General Public License v2.0
4 stars 0 forks source link

Predicting profiles #9

Open thoree opened 4 years ago

thoree commented 4 years ago
# A feature wish: a ranked list of possible profiles. A small example:

library(pedprobr)
#> Loading required package: pedtools
x = nuclearPed(2, father = "FA")
m1 = marker(x, "3" = 1, "4" = 1:2, alleles = 1:2)
m2 = marker(x, "3" = 1, "4" = 2, alleles = 1:2)
x = setMarkers(x, list(m1, m2))
p1 = oneMarkerDistribution(x, "FA",  partialmarker = 1, verbose = F)
p2 = oneMarkerDistribution(x, "FA",  partialmarker = 2, verbose = F)

# Apriori there are 9 possible genotype
# combinations (profiles) for 'FA' for these markers.
# The (posterior) probabilities are

p1%o%p2
#>     1/1       1/2 2/2
#> 1/1   0 0.3333333   0
#> 1/2   0 0.6666667   0
#> 2/2   0 0.0000000   0

# More generally, it would be interesting to
# have a list of the possible profiles for FA for n markers
# ranked according to how probable they are. This could then
# be used to search for "FA", say in a database of missing persons.

Created on 2019-11-09 by the reprex package (v0.3.0)

magnusdv commented 4 years ago

Thanks, that is an interesting idea, which shouldn't be too difficult to implement. I'm not sure exactly how you envision the output to look like, though. You say a "list of profiles", but could you give an explicit example?

Below is a quick attempt using your example, resulting in a sorted data frame. Is this in the right direction?

library(pedprobr)
#> Loading required package: pedtools

x = nuclearPed(2, father = "FA")
m1 = marker(x, "3" = 1, "4" = 1:2, alleles = 1:2)
m2 = marker(x, "3" = 1, "4" = 2, alleles = 1:2)

p1 = oneMarkerDistribution(x, "FA",  partialmarker = m1, verbose = F)
p2 = oneMarkerDistribution(x, "FA",  partialmarker = m2, verbose = F)

# Posterior probabilities
pp = p1 %o% p2

# Indices of positive entries
idx = which(pp > 0, arr.ind = T)

# Result as data.frame
res = data.frame(
  M1 = rownames(pp)[idx[, 'row']], 
  M2 = colnames(pp)[idx[, 'col']], 
  Prob = pp[pp > 0], 
  stringsAsFactors = F)

# Sort
res[order(res$Prob, decreasing = T), , drop = F]
#>    M1  M2      Prob
#> 2 1/2 1/2 0.6666667
#> 1 1/1 1/2 0.3333333

Created on 2019-11-11 by the reprex package (v0.3.0)

thoree commented 4 years ago

Very nice! I tried this on an MP example, F5. Then one could restrict attention to markers for which EP>0. However, in this example, and I fear in most MP examples, the list will be too long and no candidate profiles will stand out as as clearly more likely. In some cases, like the below one from your Argentina talk I would think the profile of MP could be shortlisted

image,

magnusdv commented 4 years ago

Ok, here is a sketch of a function, which (I think) does more or less what you want. I don't know how it will react when the dimensions blow up, but I included a parameter maxPedMarker which limits the number of genotypes considered for each marker.

library(pedprobr)
#> Loading required package: pedtools

rankProfiles = function(x, id, markers = NULL, maxPerMarker = Inf, verbose = TRUE) {

  if(!hasMarkers(x)) 
    stop("No markers attached to pedigree")

  if(is.null(markers))
    markers = seq_len(nMarkers(x))

  # Extract wanted markers
  x = selectMarkers(x, markers)
  nmark = nMarkers(x)

  # Marker names
  mnames = name(x, 1:nmark)
  mnames[is.na(mnames)] = as.character((1:nmark)[is.na(mnames)])

  # Posterior distributions for each marker
  omdList = lapply(1:nmark, function(i) {
    omd = oneMarkerDistribution(x, ids = id, partialmarker = i, verbose = FALSE)
    a = omd[omd > 0, drop = F]

    if(maxPerMarker < length(a))
      a = sort(a, decreasing = T)[1:maxPerMarker]

    # Trick to ensure proper naming in `expand.grid` below
    names(dimnames(a)) = mnames[i]

    if(verbose) {
      print(a)
      cat("\n")
    }

    a
  })

  # Array with total posterior probs
  pp = Reduce(`%o%`, omdList)

  # Transform dimnames into data frame with possible profiles
  profiles = expand.grid(dimnames(pp))

  # Add probabilities
  prob = as.numeric(pp)
  profiles$Posterior = prob

  # Sort
  profiles = profiles[order(prob, decreasing = T), , drop = F]
  row.names(profiles) = NULL

  profiles
}

And here's an expanded version of your original example:

x = nuclearPed(2, father = "FA")
m1 = marker(x, "3" = 1, "4" = 1:2, alleles = 1:2, name = "m1")
m2 = marker(x, "3" = 1, "4" = 2, alleles = 1:2, name = "m2")
m3 = marker(x, "3" = 2, "4" = 2, alleles = 1:2, name = "m3")
x = setMarkers(x, list(m1, m2, m3))

rankProfiles(x, "FA")
#> m1
#>       1/1       1/2 
#> 0.3333333 0.6666667 
#> 
#> m2
#> 1/2 
#>   1 
#> 
#> m3
#>       1/2       2/2 
#> 0.3333333 0.6666667
#>
#>    m1  m2  m3 Posterior
#> 1 1/2 1/2 2/2 0.4444444
#> 2 1/2 1/2 1/2 0.2222222
#> 3 1/1 1/2 2/2 0.2222222
#> 4 1/1 1/2 1/2 0.1111111

Created on 2019-11-11 by the reprex package (v0.3.0)

thoree commented 4 years ago

Thanks, I will try some examples and get back to you!