uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

block_size_issues #47

Closed borletti closed 11 months ago

borletti commented 11 months ago

Hi, I am exploring this useful package admixtools2, I have a problem/doubt when I run extract_f2 all work well, but when I read my genotype files with f2_from_geno the number of blocks is just one. I have used blgsize in the two ways: 0.05 and 4000000. But the number of blocks is equal to 1. So, I can work with one block, I think that something that I am doing is wrong. Please, I will appreciate any advice.

genotype_data = "/media/hd1/danielrivadeneira/UCES_PhD/6-riv-snps/30_admix2/admix2.all.fs.bi.het.md100.mss75.ld50_10_05.rename.sort" f2_dir = "/media/hd1/danielrivadeneira/UCES_PhD/6-riv-snps/30_admix2/fstats/" extract_f2(genotype_data, f2_dir,minac2 = 2, adjust_pseudohaploid = T) Reading allele frequencies from EIGENSTRAT files... admix2.all.fs.bi.het.md100.mss75.ld50_10_05.rename.sort.geno has 41 samples and 23574 SNPs Calculating allele frequencies from 41 samples in 7 populations ! 22916 SNPs remain after filtering. 22916 are polymorphic. Allele frequency matrix for 22916 SNPs and 7 populations is 4 MB Computing pairwise f2 for all SNPs and population pairs requires 51 MB RAM without splitting omputing without splitting since 51 < 8000 (maxmem)... Data written to /media/hd1/danielrivadeneira/UCES_PhD/6-riv-snps/30_admix2/fstats/ f2_blocks = f2_from_precomp(f2_dir) Reading precomputed data for 7 populations... Reading f2 data for pair 28 out of 28... dim(f2_blocks) [1] 7 7 1

uqrmaie1 commented 11 months ago

I suspect that the .bim or .snp file is not formatted as expected. The 3rd column should have genetic distance in Morgan units. The 4th column should have genetic positions in base pair units, although that column will only be used if blgsize is 100 or greater. If you can share the first and last few lines from the .bim or .snp file, I might be able to see what the problem is.

borletti commented 11 months ago

Hi Robert, thank you so much for replying. I understand, I do not have a genetic map, so I am using of 4,000,000 as in the article mentioned. Please, I attach the lines of <.bim> file. Thank you. head: 1 6587_145 0 145 A T 1 6587_155 0 155 T A 1 6587_170 0 170 T C 1 6587_236 0 236 T G 1 6587_245 0 245 C G 1 6587_277 0 277 G A 1 6587_286 0 286 C T 1 6587_333 0 333 T G 1 6587_352 0 352 A T 1 6587_353 0 353 A G tail: 1 4770_39 0 39 A G 1 4770_113 0 113 A T 1 4770_145 0 145 T C 1 4770_217 0 217 G T 1 5512_150 0 150 T A 1 5512_162 0 162 G A 1 5512_192 0 192 G A 1 4757_153 0 153 A T 1 4757_163 0 163 A G 1 4757_176 0 176 G A

uqrmaie1 commented 11 months ago

There are a couple of things that are unusual here:

  1. The file is not ordered by position. The algorithm for splitting the data into blocks requires the variants to be ordered first by chromosome, and second by position.
  2. The combination of chromosome and position is not unique (6587_145 and 4770_145 are both chr 1, pos 145). If those are multiallelic SNPs, you should keep only one alt allele per site. Although based on the IDs, it doesn't look like this is the case here.
  3. The base pair numbers appear to be too small to be individual base pairs. If you set the block size to 4,000,000 and all values in column 4 are less than that, then you would expect to only get a single block.
  4. It looks like there is only a single chromosome. This shouldn't cause a problem, but it's unusual, and it helps explain why you get a single block.

What organism is this? How was the file generated?

borletti commented 11 months ago

Thank you so much for detailed explanation. I am working with SNPs which were extracted from UCEs from non-model organism (frogs). The SNPs are located in contigs/scaffolds, so I have non-standard chromosome names. I was replacing the first column by 1s, to be read by admixtools2. My data has 23,574 SNPs with ~1,249 contigs. I am using to generate the input:
plink --vcf admix2.all.fs.bi.het.md100.mss75.ld50_10_05.rename.sort.vcf --make-bed --allow-extra-chr --double-id --out output

What do you think to use function instead of ? The head of my data without any change and order: uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:145 0 145 A T uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:155 0 155 T A uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:170 0 170 T C uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:236 0 236 T G uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:245 0 245 C G uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:277 0 277 G A uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:286 0 286 C T uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:333 0 333 T G uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:352 0 352 A T uce-6587_olivaceus_20_castelo uce-6587_olivaceus_20_castelo:353 0 353 A G tail: uce-4770_olivaceus_20_castelo uce-4770_olivaceus_20_castelo:39 0 39 A G uce-4770_olivaceus_20_castelo uce-4770_olivaceus_20_castelo:113 0 113 A T uce-4770_olivaceus_20_castelo uce-4770_olivaceus_20_castelo:145 0 145 T C uce-4770_olivaceus_20_castelo uce-4770_olivaceus_20_castelo:217 0 217 G T uce-5512_olivaceus_20_castelo uce-5512_olivaceus_20_castelo:150 0 150 T A uce-5512_olivaceus_20_castelo uce-5512_olivaceus_20_castelo:162 0 162 G A uce-5512_olivaceus_20_castelo uce-5512_olivaceus_20_castelo:192 0 192 G A uce-4757_olivaceus_20_castelo uce-4757_olivaceus_20_castelo:153 0 153 A T uce-4757_olivaceus_20_castelo uce-4757_olivaceus_20_castelo:163 0 163 A G uce-4757_olivaceus_20_castelo uce-4757_olivaceus_20_castelo:176 0 176 G A

Thank you so much for your help.

uqrmaie1 commented 11 months ago

If you replace each contig ID by a different number (1 to 1249), and set auto_only = FALSE in extract_f2(), then you should get at least as many blocks as contigs. extract_afs() calculates allele frequencies for each population. That can be interesting in itself, and in certain cases extract_f2() uses the extract_afs() function, but the functions that you need to call to get the data that's needed for subsequent Admixtools 2 analyses (like qpadm() or qpgraph()) are either extract_f2() or f2_from_geno(); or you can skip this step and read the data from the genotype files each time.

borletti commented 11 months ago

Thank you so much, that worked! I am going to close this thread.