magnusdv / pedtools

Tools for working with pedigrees in R
GNU General Public License v3.0
23 stars 3 forks source link

`setFrequencyDatabase` and marker name #21

Closed thoree closed 4 years ago

thoree commented 5 years ago

I did not manage to get the variable names right (below files are in https://github.com/thoree/exclusion/tree/master/magnus and I used reprex_addin() for the below):

library(pedtools) x = readPed("pedfile.ped", sep ="/") x = setFrequencyDatabase(x, database = "dbfile.txt")

> M1 M2

> 1 0.5 NA

> 2 0.5 0.8

> 3 NA 0.2

> Error: Unknown marker name: M1, M2

Created on 2019-08-09 by the reprex package (v0.3.0)

magnusdv commented 5 years ago

Thanks for reporting this - it revealed a couple of intricate bugs in the handling of marker names when converting from data.frame into ped objects.

Your example works fine now, so I'm closing this issue. But there may be other surprises here, more testing is needed.

thoree commented 5 years ago

Great new functionality! Previous example is now fine. With one column for each allele, my understanding is that what goes before "." becomes the marker name. Fine! readPed seems to accept identical marker names sometimes, but sometimes make sensible change: for sep = "/" marker names are changed to obtain difference:

setwd("C:\\git\\exclusion\\magnus")
library(pedtools)
x = readPed("pedfile2.ped")
x
#>  id fid mid sex  M1  M1
#>   1   *   *   1 1/1 2/3
#>   2   *   *   2 -/- -/-
#>   3   1   2   2 -/- -/-
x = readPed("pedfile3.ped", sep = "/")
x
#>  id fid mid sex  M1 M1.1
#>   1   *   *   1 1/1  2/3
#>   2   *   *   2 -/-  -/-
#>   3   1   2   2 -/-  -/-

Created on 2019-08-10 by the reprex package (v0.3.0)

magnusdv commented 5 years ago

Hmm, perhaps we should be stricter about this.

When marker names are included in the pedfile, I cannot imagine a scenario where duplicated names would be OK. I suggest to give an error in this case.

Agree?

thoree commented 5 years ago

Agree!

thoree commented 5 years ago

Here's an Argentina example that works nicely. I had to change marker names 'Penta D' and 'Penta _E' to 'Penta.D' and 'Penta.E' in the uploaded files. Below data is read and a mutation model introduced

setwd("C:\\git\\exclusion\\magnus")
library(pedprobr,quietly = TRUE)
x = readPed("x.ped", sep="/")
x = setFrequencyDatabase(x, database = "db.txt")
likelihood(x,2) # = 0 because of likely mutation
#> [1] 0
x = setLocusAttributes(x, markers = 2, loc = list(mutmod = "proportional", rate = 0.05))
likelihood(x,2)
#> [1] 1.298405e-07

Created on 2019-08-10 by the reprex package (v0.3.0)

magnusdv commented 5 years ago

Ok, I've done some changes now, and have a few clarifications.

1. Marker names when alleles are in separate columns The rule is as follows: The names of odd-numbered columns are used, and everything from the last period is deleted.

Example:

library(pedtools)
df = cbind(as.data.frame(singleton(1)), Penta.D.1 = 1, Penta.D.2 = 2)
df
#>   id fid mid sex Penta.D.1 Penta.D.2
#> 1  1   0   0   1         1         2

as.ped(df)
#>  id fid mid sex Penta.D
#>   1   *   *   1     1/2

2. "Fixing" marker names in readPed() Nothing should ever be changed with the input, For example, the space in "Penta E" should remain (although such names should of course be avoided in general). I have now added check.names = F in the call to read.table to ensure this.

3. Duplicated marker names This should give an error, and is now checked for in readPed():

library(pedtools)
readPed("https://raw.githubusercontent.com/thoree/exclusion/master/magnus/pedfile3.ped", sep = "/")
#> Error: Duplicated marker name(s): M1