hardingnj / xpclr

Code to compute the XP-CLR statistic to infer natural selection
MIT License
87 stars 28 forks source link

The format of gdistkey file? #71

Open xiekunwhy opened 3 years ago

xiekunwhy commented 3 years ago

Hi,

Could you explain what is --gdistkey and provide an example format? And what is the difference between --gdistkey file and --map file?

Best, Kun

hardingnj commented 3 years ago

The map file is the same format as required by the original tool: there is an example in the fixtures folder: https://github.com/hardingnj/xpclr/blob/master/fixture/mapfile.snp

The --gdistkey is used when reading from VCF or zarr (via scikit-allele vcf2zarr). For example you may have a key "gen_dist" that is in the INFO section.

anhong11 commented 2 years ago

Hey,

I have added the "gen_dist" in the INFO section of my VCF file, but it still said "No genetic distance provided; using rrate of 1e-08/bp". Do you know what's wrong with my VCF? my cmd is:

xpclr -Sa ../a.txt -Sb ../b.txt -C A01 -I ../bra_only_phased_gendist.vcf -O a_A01.txt --phased --gdistkey gen_dist

and my vcf looks like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sam1 sam2 sam3
A01     2977    A01:2977        C       T       .       PASS    gen_dist=0.111     GT      0|0     0|0     0|1 
James-S-Santangelo commented 2 years ago

Hey all,

I ran into the same issue. It looks like gdistkey is set to None when the VCF is loaded, regardless of what key is passed. In other words, the load_vcf_format_data() function in util.py doesn't use the gdistkey passed at the command line, unlike the hdf5 and zarr loading.

anhong11 commented 2 years ago

Yes, I found this too. I change util.py a little bit, so the script can load gen_dist from the column QUAL:

def load_vcf_wrapper(path, seqid, samples, samples_path):

    callset = allel.read_vcf(
        path,
        region=seqid,
        fields=['variants/POS', 'variants/QUAL','calldata/GT', 'samples'],
        tabix="tabix",
        samples=samples)

    assert "samples" in callset.keys(), "None of the samples provided in {0!r} are found in {1!r}".format(
        samples_path, path)

    p = allel.SortedIndex(callset["variants/POS"])
    g = allel.GenotypeArray(callset['calldata/GT'])
    m = allel.SortedIndex(callset["variants/QUAL"])

    return p, g, m

def load_vcf_format_data(vcf_fn, chrom, s1, s2, gdistkey=None):

    #    geno1, geno2, pos = q, q, q
    samples1 = get_sample_ids(s1)
    samples2 = get_sample_ids(s2)
    pos1, geno1, cm1 = load_vcf_wrapper(vcf_fn, chrom, samples1, s1)
    pos2, geno2, cm2 = load_vcf_wrapper(vcf_fn, chrom, samples2, s2)

    assert np.array_equal(pos1, pos2), "POS fields not the same"
    assert geno1.shape[0] == pos1.shape[0], "For samples 1, genotypes do not match positions"
    assert geno2.shape[0] == pos2.shape[0], "For samples 2, genotypes do not match positions"
    assert geno1.shape[1] == len(samples1)
    assert geno2.shape[1] == len(samples2)

    if gdistkey is not None:
        gdist = cm1
    else:
        gdist = None

    return geno1, geno2, pos1, cm1
hardingnj commented 2 years ago

Thanks all. Agree this seems like a bug- should be fixable by incorporating the code from the load hdf5 function into the VCF function.