Open xiekunwhy opened 3 years ago
The map file is the same format as required by the original tool: there is an example in the fixtures
folder: https://github.com/hardingnj/xpclr/blob/master/fixture/mapfile.snp
The --gdistkey is used when reading from VCF or zarr (via scikit-allele vcf2zarr). For example you may have a key "gen_dist" that is in the INFO section.
Hey,
I have added the "gen_dist" in the INFO section of my VCF file, but it still said "No genetic distance provided; using rrate of 1e-08/bp". Do you know what's wrong with my VCF? my cmd is:
xpclr -Sa ../a.txt -Sb ../b.txt -C A01 -I ../bra_only_phased_gendist.vcf -O a_A01.txt --phased --gdistkey gen_dist
and my vcf looks like:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sam1 sam2 sam3
A01 2977 A01:2977 C T . PASS gen_dist=0.111 GT 0|0 0|0 0|1
Hey all,
I ran into the same issue. It looks like gdistkey is set to None
when the VCF is loaded, regardless of what key is passed. In other words, the load_vcf_format_data()
function in util.py
doesn't use the gdistkey passed at the command line, unlike the hdf5 and zarr loading.
Yes, I found this too. I change util.py a little bit, so the script can load gen_dist from the column QUAL:
def load_vcf_wrapper(path, seqid, samples, samples_path):
callset = allel.read_vcf(
path,
region=seqid,
fields=['variants/POS', 'variants/QUAL','calldata/GT', 'samples'],
tabix="tabix",
samples=samples)
assert "samples" in callset.keys(), "None of the samples provided in {0!r} are found in {1!r}".format(
samples_path, path)
p = allel.SortedIndex(callset["variants/POS"])
g = allel.GenotypeArray(callset['calldata/GT'])
m = allel.SortedIndex(callset["variants/QUAL"])
return p, g, m
def load_vcf_format_data(vcf_fn, chrom, s1, s2, gdistkey=None):
# geno1, geno2, pos = q, q, q
samples1 = get_sample_ids(s1)
samples2 = get_sample_ids(s2)
pos1, geno1, cm1 = load_vcf_wrapper(vcf_fn, chrom, samples1, s1)
pos2, geno2, cm2 = load_vcf_wrapper(vcf_fn, chrom, samples2, s2)
assert np.array_equal(pos1, pos2), "POS fields not the same"
assert geno1.shape[0] == pos1.shape[0], "For samples 1, genotypes do not match positions"
assert geno2.shape[0] == pos2.shape[0], "For samples 2, genotypes do not match positions"
assert geno1.shape[1] == len(samples1)
assert geno2.shape[1] == len(samples2)
if gdistkey is not None:
gdist = cm1
else:
gdist = None
return geno1, geno2, pos1, cm1
Thanks all. Agree this seems like a bug- should be fixable by incorporating the code from the load hdf5 function into the VCF function.
Hi,
Could you explain what is --gdistkey and provide an example format? And what is the difference between --gdistkey file and --map file?
Best, Kun