Open hyanwong opened 9 months ago
If it's just text we can happily support in tskit itself.
It's a bit annoying that the output consists of two text files, which is which I thought tsconvert
might be better. Here's a first stab at a conversion routine. It's not extensively tested, though, and there's some faffing with metadata that might be considered unnecessary (but was useful e.g. for plotting Relate inferences from the Kreitman dataset).
def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
"""
Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
(see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)
``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
To obtain either as strings, you can pass an io.StringIO object here.
``sample_name_field`` gives the metadata field in which to look up names to use in the
output sample file. Where possible, names are taken from the associated individual metadata.
If samples are not associated with individuals (i.e. this is haploid data), then
names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
otherwise.
Returns an array of the site_ids that were written to the haps file (sites
with 1 allele or > 2 alleles are skipped)
.. example::
with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
ts_to_haps_sample(ts, haps, sample)
"""
used = np.zeros(ts.num_sites, dtype=bool)
for v in ts.variants():
if len(v.alleles) == 1:
continue
if len(v.alleles) > 2:
print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
continue
used[v.site.id] = True
print(
str(chromosome_number),
f"SNP{v.site.id}",
int(v.site.position),
v.alleles[0],
v.alleles[1],
" ".join([str(g) for g in v.genotypes]),
sep=" ",
file=haps_output,
)
print("ID_1 ID_2 missing", file=sample_output)
print("0 0 0", file=sample_output)
individuals = ts.nodes_individual[ts.samples()]
if np.all(individuals == tskit.NULL):
# No individuals, just use node metadata
pass
else:
if np.any(individuals == tskit.NULL):
raise ValueError("Some samples have no individuals")
_, counts = np.unique(individuals, return_counts=True)
if np.all(counts == 2):
if np.any(np.diff(individuals)[0::2]) != 0:
ValueError("Pairs of adjacent samples must come from the same individual")
elif np.all(counts == 1):
pass
else:
raise ValueError("Must have all diploid or all haploid samples")
samples = ts.samples()
i=0
while i < len(samples):
ind1 = ts.node(samples[i]).individual
ind2 = tskit.NULL
if ind1 == tskit.NULL:
try:
name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
except (TypeError, KeyError):
name = f"Sample_{samples[i]}"
else:
try:
name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
except (TypeError, KeyError):
name = f"Individual_{ind1}"
try:
ind2 = ts.node(samples[i+1]).individual
except IndexError:
pass
if ind2 == tskit.NULL or ind2 != ind1:
print(f'{name} NA 0', file=sample_output)
i += 1
else:
print(f'{name} {name} 0', file=sample_output)
i += 2
return np.where(used)[0].astype(ts.mutations_site.dtype)
Yep, pass in the file handles directly, and leave out the metadata stuff (follow the existing conventions for vcf and other text formats for generating sample names), and it should be fine for tskit.
Both Relate and ARG-needle read data in "Oxford phased haplotype file" (.haps + .sample) format, as described at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample (also see https://www.cog-genomics.org/plink/2.0/formats#haps). This could be a useful output format for a tree sequence, to supplement VCF?
Conversion can apparently be done from VCF using plink or one of the Relate utilities:
or
(see https://myersgroup.github.io/relate/input_data.html#ConvertToFileFormat)