Distance matrix ? - Githubissues

aguilar-gomez commented 2 years ago

What is the output supposed to look like?

I have a bcf.gz file that I ran hoping to calculate distances between 347 individuals. I do not think the documentation is super clear, so I ran it like this: distAngsd -method geno -vcf ttc39b.bcf.gz -is2Dinfer 0 -RD 1.0 -e 0.002 -tdiv 1.0 -t1 0.4 -t2 0.25 and as output I just go this:

-> readbcfvcf seek:(null)
-> Done with preliminary parsing of file: we have data for 1 out of 631034 reference sequences
-> Setting iterator to: P_RNA_scaffold_17998
-> [file='ttc39b.bcf.gz'][chr='P_RNA_scaffold_17998'] Read 274 records 274 of which were SNPs. Number of sites used for downstream analysis (MAF >=

0.000000):274 Done reading everything we have nsites:274 for samples:347 1 1 1 1 1 0.25 0.25 0.25 0.25 Model Method Vcf Threading Out_uchar Out_binary JC geno ttc39b.bcf.gz 0 0 1 vcftest done! Estimated t = 0.0466514

Is the t the distance between the first two individuals of the vcf? Is there a way of calculating distances with this method for many individuals at the same time?

lz398 commented 2 years ago

Thanks a lot and yes, the t is the estimated genetic distance. The software was designed for large genomes (e.g., humans), so at first, I was hesitant to run multiple individual pairs at the same time, after all, it required huge memory to load multiple pairs. But I think it is an excellent suggestion, I will consider adding this feature for smaller genome analyses in the future.

Lei

MoritzBlumer commented 2 years ago

Hello Lei, I am also very keen to try out distAngsd on a larger dataset (> 2,000 vertebrate genomes), and running into the same problem. The VCF I'm working on is by chromosome, and contains all samples. Is it possible to calculate all possible pairwise distances between the included samples at once (like with ngsDist)? Or is it necessary to generate a pairwise VCF for all possible combinations to then run distAngsd on these? From the above context, I'm not sure if a previous comment with a solution was deleted, or if this is currently not possible? Best wishes, Moritz

lz398 commented 2 years ago

Hi, Moritz, distAngsd has the potential to do this. The critical problem was loading multiple quite large genomes (our target for this proj. was to analyze human genomes), which may occupy quite a lot of memory. I would like to add this feature in a month or so. But if you are in a hurry, I actually suggest you write a parallel shell script for a quick answer. Lei

MoritzBlumer commented 2 years ago

Hello Lei,

many thanks for the prompt response with explanations! I can see the problem of loading many large genomes. This used to be the limiting factor with my previous ngsDist approach as well. I did split up the chromosomes into 1 Mb windows and combined the matrices weighted by number of variable sites (even though that might not be ideal, because the joint distribution of genotypes is window-specific and not genome-(or chromosome)-wide). But even when operating on such windows, deriving pairwise VCFs from the full VCF for all possible combinations of thousands of samples is computationally intensive (using bcftools). Maybe it could also be an alternative option to just add a functionality to specify two input samples while invoking distANGSD (when operating on a multi-sample VCF file). Then the user could easily incorporate distANGSD into a custom parallel script, without the need to extract thousands of pairwise VCFs. In any case, many thanks for providing the software and for being so responsive!

Moritz

TeresaPegan commented 1 year ago

I will just add that I would also love an option to specify which two samples to use, so that I can avoid making thousands of pairwise input files. I would like to use glf files from ANGSD, and I have full-population versions of these files that would be great to use as input if possible, like with ngsDist. I also do sympathize with the large genome memory problem and have also had trouble with that using ngsDist -- I typically run these analyses on one chromosome at a time.

ANGSD commented 1 year ago

Hello this would be easy to implement if it is not already implemented in our development version. Can you make a github issue about this, so we dont forget it.

Thanks!

On 17 Sep 2022, at 23.18, Teresa @.***> wrote:

I will just add that I would also love an option to specify which two samples to use, so that I can avoid making thousands of pairwise input files. I would like to use glf files from ANGSD, and I have full-population versions of these files that would be great to use as input if possible, like with ngsDist. I also do sympathize with the large genome memory problem and have also had trouble with that using ngsDist.

— Reply to this email directly, view it on GitHub https://github.com/lz398/distAngsd/issues/4#issuecomment-1250141900, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3WGEQIFHSCJPH5HS3DV6YYR7ANCNFSM53XG5MOQ. You are receiving this because you are subscribed to this thread.

lz398 / distAngsd

Distance matrix ? #4