sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
236 stars 32 forks source link

Write family information in `write_plink` #1010

Open tomwhite opened 1 year ago

tomwhite commented 1 year ago

If the dataset has sample_family_id, sample_paternal_id, sample_maternal_id fields (e.g. from read_plink), then we can use those to write family information in write_plink. (See https://www.cog-genomics.org/plink/1.9/formats#fam)

Otherwise we should set FID to "0" (missing) and IID to sample_id. The father and mother IDs should either be set to missing, or set from the parent_id variable if it is present.

(Or can we not assume anything about paternal/materal ordering in the parent_id variable? Thoughts @timothymillar, @jeromekelleher?)

jeromekelleher commented 1 year ago

This is tricky, I don't think we thought much about interoperability with Plink when doing the pedigree encoding @timothymillar ?

timothymillar commented 1 year ago

Or can we not assume anything about paternal/materal ordering in the parent_id variable

In general no, but the intent was that these can be specified using coordinate when appropriate. So, you could use the parent_id array on the condition that appropriate coords are set.

tomwhite commented 1 year ago

Thanks @timothymillar - that's a good suggestion.