tskit-dev / tsconvert

Utilities for converting tree sequences to and from other formats
MIT License
7 stars 6 forks source link

Oxford phased haplotype file (.haps) + .sample export #55

Open hyanwong opened 9 months ago

hyanwong commented 9 months ago

Both Relate and ARG-needle read data in "Oxford phased haplotype file" (.haps + .sample) format, as described at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample (also see https://www.cog-genomics.org/plink/2.0/formats#haps). This could be a useful output format for a tree sequence, to supplement VCF?

Conversion can apparently be done from VCF using plink or one of the Relate utilities:

plink2 --vcf phased.vcf --export haps --out new_filename_prefix

or

PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertFromVcf \
                 --haps example.haps \
                 --sample example.sample \
                 -i example 

(see https://myersgroup.github.io/relate/input_data.html#ConvertToFileFormat)

jeromekelleher commented 9 months ago

If it's just text we can happily support in tskit itself.

hyanwong commented 9 months ago

It's a bit annoying that the output consists of two text files, which is which I thought tsconvert might be better. Here's a first stab at a conversion routine. It's not extensively tested, though, and there's some faffing with metadata that might be considered unnecessary (but was useful e.g. for plotting Relate inferences from the Kreitman dataset).

def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
    """
    Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
    (see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)

    ``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
    To obtain either as strings, you can pass an io.StringIO object here.

    ``sample_name_field`` gives the metadata field in which to look up names to use in the
    output sample file. Where possible, names are taken from the associated individual metadata.
    If samples are not associated with individuals (i.e. this is haploid data), then
    names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
    the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
    otherwise.

    Returns an array of the site_ids that were written to the haps file (sites
    with 1 allele or > 2 alleles are skipped)

    .. example::

        with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
            ts_to_haps_sample(ts, haps, sample)
    """
    used = np.zeros(ts.num_sites, dtype=bool)
    for v in ts.variants():
        if len(v.alleles) == 1:
            continue
        if len(v.alleles) > 2:
            print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
            continue
        used[v.site.id] = True
        print(
            str(chromosome_number),
            f"SNP{v.site.id}",
            int(v.site.position),
            v.alleles[0],
            v.alleles[1],
            " ".join([str(g) for g in v.genotypes]),
            sep=" ",
            file=haps_output,
        )

    print("ID_1 ID_2 missing", file=sample_output)
    print("0    0    0", file=sample_output)
    individuals = ts.nodes_individual[ts.samples()]
    if np.all(individuals == tskit.NULL):
        # No individuals, just use node metadata
        pass
    else:
        if np.any(individuals == tskit.NULL):
            raise ValueError("Some samples have no individuals")
        _, counts = np.unique(individuals, return_counts=True)
        if np.all(counts == 2):
            if np.any(np.diff(individuals)[0::2]) != 0:
                ValueError("Pairs of adjacent samples must come from the same individual")
        elif np.all(counts == 1):
            pass
        else:
            raise ValueError("Must have all diploid or all haploid samples")
    samples = ts.samples()
    i=0
    while i < len(samples):
        ind1 = ts.node(samples[i]).individual
        ind2 = tskit.NULL
        if ind1 == tskit.NULL:
            try:
                name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Sample_{samples[i]}"
        else:
            try:
                name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Individual_{ind1}"
            try:
                ind2 = ts.node(samples[i+1]).individual
            except IndexError:
                pass
        if ind2 == tskit.NULL or ind2 != ind1:
            print(f'{name} NA 0', file=sample_output)
            i += 1
        else:
            print(f'{name} {name} 0', file=sample_output)
            i += 2
    return np.where(used)[0].astype(ts.mutations_site.dtype)
jeromekelleher commented 9 months ago

Yep, pass in the file handles directly, and leave out the metadata stuff (follow the existing conventions for vcf and other text formats for generating sample names), and it should be fine for tskit.