tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

Text file format for pedigree data input. #1915

Closed jeromekelleher closed 2 years ago

jeromekelleher commented 2 years ago

As discussed in #1913, the standard text files like plink's FAM don't contain enough information and we need some custom file format to specify our pedigrees.

@LukeAndersonTrocme can you give some examples and maybe sketch out what you'd like to have as the text file format input?

LukeAndersonTrocme commented 2 years ago

I may have to follow up on the first part once I do a little bit of poking around to see what I can find regarding examples of standard fam formats with extra columns attached. The genealogies that I have worked with or seen in other papers often resemble fam formats with the first three columns being individual and parental IDs, some have sex and some space-time metadata.

As far as what I'd like to see in a text format, I would imagine that something containing the bare minimum information needed for the simulations would be a good starting place. (on that note, please let me know if I'm missing anything here). Here's what I have in mind:

IID, MID, FID, TIME, IS_SAMPLE
1,     3,   4,  0,  1
2,     3,   4,  0,  1
3,    NA,  NA,  1, -1
4,    NA,  NA,  1, -1

As mentioned in #1913, a population column could be useful for simulating beyond the pedigree, but I think this would be best implemented as an advanced feature that can only be done using the API. This has the added benefit of avoiding the need to verify the validity of the population IDs. (unless there are good reasons to include population IDs in this basic format?)

One thought I have is whether we embrace the three column ped format for this vanilla simulator and offer some pre-processing for these users. I suspect that some users may not know how to get the time column. I have a short R function that recursively climbs a tree to find the deepest node depth for each individual and takes as input a three column ped file and a list of probands. (Runs on 1.2 million probands of the French-Canadian genealogy in 15 seconds on my laptop, happy to paste the code over if it's useful).

Just one note about the list of probands, the only reason why I specify probands in my case is because we have "dead-ends" in the French-Canadian genealogy and wanted a minimum date to be considered a present day proband, otherwise we could just automatically consider probands to be all individuals who are not fathers or mothers.

The real advantage with this is that we would start from the basic three column ped file that many people would be used to, and we would add time and is_sample by running this pre-processing step. Of course, users may also specify probands/samples themselves.

edit: Am I right in assuming that all we need apart from the three ped file columns are time and is_sample ?

LukeAndersonTrocme commented 2 years ago

Just following up on the last comments from https://github.com/tskit-dev/msprime/pull/1913#issuecomment-968347694

Completely agree with the issues about lacking column names, and not being able to use a 3-column FAM file. However, as I outlined above, it seems feasible to include a little FAM file parser to include the additional columns needed (and likely useful to include column names to be clear about what's what). This would be useful to those who just want to run their FAM files as is.

One thing that I'm less clear on is the usefulness of adding population IDs. Is this just useful for the founders so we know which population to assign them beyond the known-pedigree? In this case, we may be able to include another column is_founder for individuals without known-parents. This flag could then be used to help users identify which individuals they may want to add population labels to (if they wanted to). Although I still think that population labels for fancy out-of-pedigree demographics would make sense to reserve for the PedigreeBuilder API.

petrelharp commented 2 years ago

One thing that I'm less clear on is the usefulness of adding population IDs.

People may want to do this sometimes, but it's not more useful than other sorts of metadata (birth location; name; sptial coordinates; age at death...). Of course, it's easy to just stick 0 in for the population column, as for phenotype in the FAM files.

And - sorry, I've not paid enough attention - but how exactly is the time field used? The docs say that time is in units of "generations in the past", but in practice this is ill-defined, as different paths up to the same ancestor can easily have different numbers of hops. How do you deal with this, @LukeAndersonTrocme (or do you have to?)? I'm guessing that what we'd really like for time is the individual's birth time, in units of (mean generation time) ago; and in practice what the times mean is that the number of mutations separating offspring from parents is proportional to the time difference?

jeromekelleher commented 2 years ago

And - sorry, I've not paid enough attention - but how exactly is the time field used?

This is the time of the two nodes associated with the individual @petrelharp. It's used to (a) sort the individuals and (b) decide when stuff gets added in to the simulation. (We originally used topogical sort, but then had to sort the edges by time later anyway to make a valid ts, so it's simpler to sort by time up front.) It's far simpler for the msprime simulation code if time is a fixed thing, so for now we're making it a required input.

jeromekelleher commented 2 years ago

Thanks for the input @LukeAndersonTrocme. I want to decouple this from time imputation because (a) there will be cases where we do know the times, and we want to support this; and (b) it's a non-trivial problem, which we don't need to solve for the first version. (Of course it's useful having an implementation that works for your application, but we have to think hard about what would be the right thing to do more generally, and have to think through all the corner cases where it might fail, write tests for these, etc etc etc.)

I've made the issue #1916 for time imputation which hopefully we can do as a follow up once we have the basics working.

jeromekelleher commented 2 years ago

Would you mind digging out a few examples of the pedigrees that you've seen @LukeAndersonTrocme? I think that'll help us a lot trying to make a file format that's familiar (hah!) for people, and allows them to create input data with the minimal amount of file hacking.

(Don't post anything confidential, obvs!)

LukeAndersonTrocme commented 2 years ago

other sorts of metadata (birth location; name; sptial coordinates; age at death...)

Is there really a need to handle metadata? I have lots of metadata for the pedigree I'm currently working on, but I don't usually keep track of it while running pedigree based statistics. In these cases I tend to work with just the three columns (IID, MID, FID). It's easy to link to the metadata later with some information being specific to an individual (marriage date, marriage location) and other information specific to locations and regions.

Suggestion: let the user define the set of probands at generation 0. This information can be used to initialize the tree climbing algorithms that define generation time. This is also useful for working with a subset of individuals (e.g. the ascending genealogy form all probands in town_A).

different paths up to the same ancestor can easily have different numbers of hops. How do you deal with this?

Given a set of probands, we use the maximum genealogical depth for each historical individual. For our purposes we wanted to ensure that we traversed all children before reaching their parents. (code here: https://github.com/tskit-dev/msprime/issues/1916#issuecomment-968924261 )

(a) there will be cases where we do know the times, and we want to support this

For example, the French-Canadian genealogy has two times for each individual: marriage date and parental marriage date

(b) it's a non-trivial problem

Indeed, and the definition we're working with is just to help us sort the tree (backwards in time). Would this definition be useful for the msprime simulation code? Or would this need to be a forwards in time definition?

LukeAndersonTrocme commented 2 years ago

Would you mind digging out a few examples of the pedigrees that you've seen @LukeAndersonTrocme? I think that'll help us a lot trying to make a file format that's familiar (hah!) for people, and allows them to create input data with the minimal amount of file hacking.

(Don't post anything confidential, obvs!)

Starting with the BALSAC genealogy because this is the one I've been working with the most. It comes with a few extra columns of metadata.

Here's a toy example (I've manually changed these values so its not confidential)

ind,father,mother,sex,datemp,saisonmp,lieump,geo_mp,datem,saisonm,lieum,geo_m
2999,1368,1450,2,1871,2,1631,POINT(-73.3522 46.0952),1796,2,1269,POINT(-73.5606 45.5426)
3081,8780,8772,1,1866,1,1269,POINT(-73.5606 45.5426),1798,1,1269,POINT(-73.5606 45.5426)
3073,87877,7869,2,1882,1,1269,POINT(-73.5606 45.5426),1798,1,1269,POINT(-73.5606 45.5426)
3065,49212,49204,1,1846,2,1025,POINT(-72.43 46.3338),1783,1,1269,POINT(-73.5606 45.5426)
3057,6846,6838,2,1839,3,1269,POINT(-73.5606 45.5426),1783,1,1269,POINT(-73.5606 45.5426)

ind,father,mother,sex are fairly standard columns, where each individual ind has a mother and father who are also in the ind column.

datemp,saisonmp,lieump,geo_mp these are the metadata columns for the parents (mp refers to parental marriage) where date, saison, lieu and geo refer to the marriage date, marriage season, marriage location_ID and marriage location lat/lon respectively. The same set of columns are then repeated for the individuals marriage (datem,saisonm,lieum,geo_m)

Over the course of my project I've added more metadata to this, mostly pertaining to the geographical information (subregion, region, super-region) but as I mentioned in my previous comment, I don't need all the metadata all the time. Especially if I have a key to link back to.

(I'll post a separate comment about genealogies I've seen in other papers)

LukeAndersonTrocme commented 2 years ago

The first example that came to mind was [this paper] that uses a very basic definition. (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008638#pcbi.1008638.s005) image

Then I took a look at this book about biobanks that has a section about genealogies (Ch. 10-11). I'll note that I think this information is from the perspective of biobanking (i.e information collecting) rather than from the perspective of research. I would expect that some of this information would not be privy to research scientists.

Here's the contents of the Swedish Multi-generation Register:

National registration number of index person 
Sex of index person
Country of birth of index person 
Registration number of biological father 
Date of birth of biological father
Country of birth of biological father 
Registration number of biological mother 
Date of birth of biological mother 
Country of birth of biological mother 
Child’s position in the mother’s family 
Number of children – mother
Number of children – father 
Registration number of adoptive father 
Date of birth of adoptive father 
Country of birth of adoptive father 
Registration number of adoptive mother 
Date of birth of adoptive mother 
Country of birth of adoptive mother 
Date of immigration
Date of adoption

And here's the Icelandic Genealogy Database :

Date of birth (yymmdd) 
Place of birth
Twinning
Legitimacy
Date of death
Identification number of father 
Date of birth of father (yymmdd) 
Name of father
Identification number of mother 
Date of birth of mother (yymmdd) 
Name of mother
Note
Other identification
Unique personal identification number 
Name
Gender

In Conclusion

I still feel like handling metadata is not necessary and may be confusing for the average user. My opinion would be to restrict the simulation to only accept (a) a three column pedigree and (b) an input set of probands. Users may want to use metadata before the simulations (to identify a specific set of probands as input) or afterwards (to compare the simulated genomes of sets of individuals). But AFAIK, both these cases don't require the need to handle metadata inside the msprime simulations.

gregorgorjanc commented 2 years ago

Barebones pedigree is often stored as

IID FID MID

sometimes FAMILY, SEX, BIRTH_DATE, BIRTH_LOCATION, POPULATION, etc. are provided in quite different formats (most don't vary much), so I would not worry about handling them - it's too big of a job and you only want to provide a way to pass a pedigree to msprime, so people should be able to cast their pedigree into a minimal format for msprime.

jeromekelleher commented 2 years ago

OK, how about

# IID FID MID TIME POPULATION METADATA
gregorgorjanc commented 2 years ago

LGTM!

jeromekelleher commented 2 years ago

OK, I'll code this up when I get a chance.

jeromekelleher commented 2 years ago

Closed in #1915. Settled on a text input which is a simplified version of the full input. Example:

# id parent0 parent1 time is_sample
0   5   4   0.0 1
1   3   3   0.0 1
2   3   4   0.0 1
3   7   7   1.0 0
4   8   8   1.0 0
5   8   6   1.0 0
6   9   10  2.0 0
7   10  10  2.0 0
8   9   9   2.0 0
9   11  12  3.0 0
10  11  12  3.0 0
11  NA  NA  4.0 0
12  NA  NA  4.0 0