isaacovercast / easySFS

Effective selection of population size projection for construction of the site frequency spectrum. Convert VCF to dadi/fastsimcoal style SFS for demographic analysis
129 stars 23 forks source link

strange result from 3population Diploid data #3

Closed biozzq closed 7 years ago

biozzq commented 7 years ago

Dear @isaacovercast , Recently I want to prepare input file for running Fastsimcoal2, but I find the SFS entries are all 0, my vcf file likes following (processed by following strategies) 1, filter the sites with maf <= 0.1 2, filter the sites with missing rate >= 0.1 3, filter the sites with abnormal depth > average+3sd or <average-3sd 4, imputation by beagle 4.1 5, filter the SNPs located in repeat region (repeat mask results) 6, filter the SNPs located in genes (exon+intro)

the VCF contains 3 populations, 43, 24 and 112 diploids respectively;

my command like following python easySFS.py -i intergentic.vcf -p all.sep --proj=86,48,224 -o MSF

image

so I do not know why the SFS entries are all 0 ? How the minor allele identified in the vcf, did it identify after computing the global frequency of the two alleles over all samples ? Many thanks!

Best Wishes

isaacovercast commented 7 years ago

SFS entries are all 0 because of the value you are setting for projecting. If you are allowing 10% missing data in the vcf, and then projecting to the full sample size it makes sense why all snps are being removed. The optimal workflow is to first run:

./easySFS -i intergenic.vcf -p pops_file.txt --preview

to establish the "best" values for down projecting each population.

biozzq commented 7 years ago

Dear @isaacovercast I have run the preview model. In addition, my vcf have been processed by beagle which will impute the missing genotypes. so I want to project to the full sample size for all populations. but I am confused about the results of SFS entries. pop1 (2, 5.0) (3, 8.0) (4, 10.0) (5, 11.0) (6, 12.0) (7, 13.0) (8, 14.0) (9, 15.0) (10, 15.0) (11, 16.0) (12, 17.0) (13, 17.0) (14, 18.0) (15, 18.0) (16, 18.0) (17, 19.0) (18, 19.0) (19, 20.0) (20, 20.0) (21, 20.0) (22, 21.0) (23, 21.0) (24, 21.0) (25, 21.0) (26, 22.0) (27, 22.0) (28, 22.0) (29, 22.0) (30, 23.0) (31, 23.0) (32, 23.0) (33, 23.0) (34, 23.0) (35, 24.0) (36, 24.0) (37, 24.0) (38, 24.0) (39, 24.0) (40, 25.0) (41, 25.0) (42, 25.0) (43, 25.0) (44, 25.0) (45, 25.0) (46, 25.0) (47, 25.0) (48, 26.0) (49, 26.0) (50, 26.0) (51, 26.0) (52, 26.0) (53, 26.0) (54, 26.0) (55, 26.0) (56, 26.0) (57, 26.0) (58, 27.0) (59, 27.0) (60, 27.0) (61, 27.0) (62, 27.0) (63, 27.0) (64, 27.0) (65, 27.0) (66, 27.0) (67, 25.0) (68, 25.0) (69, 25.0) (70, 25.0) (71, 20.0) (72, 21.0) (73, 20.0) (74, 20.0) (75, 20.0) (76, 20.0) (77, 18.0) (78, 18.0) (79, 16.0) (80, 16.0) (81, 12.0) (82, 12.0) (83, 4.0) (84, 4.0) (85, 2.0)

pop2 (2, 6.0) (3, 9.0) (4, 11.0) (5, 12.0) (6, 13.0) (7, 14.0) (8, 15.0) (9, 15.0) (10, 16.0) (11, 16.0) (12, 16.0) (13, 17.0) (14, 17.0) (15, 17.0) (16, 17.0) (17, 18.0) (18, 18.0) (19, 18.0) (20, 18.0) (21, 18.0) (22, 18.0) (23, 19.0) (24, 19.0) (25, 19.0) (26, 19.0) (27, 19.0) (28, 19.0) (29, 19.0) (30, 19.0) (31, 19.0) (32, 19.0) (33, 19.0) (34, 20.0) (35, 20.0) (36, 20.0) (37, 19.0) (38, 19.0) (39, 14.0) (40, 14.0) (41, 11.0) (42, 11.0) (43, 7.0) (44, 7.0) (45, 5.0) (46, 5.0) (47, 2.0)

pop3 (2, 6.0) (3, 9.0) (4, 10.0) (5, 12.0) (6, 13.0) (7, 13.0) (8, 14.0) (9, 14.0) (10, 15.0) (11, 15.0) (12, 15.0) (13, 16.0) (14, 16.0) (15, 16.0) (16, 16.0) (17, 17.0) (18, 17.0) (19, 17.0) (20, 17.0) (21, 17.0) (22, 17.0) (23, 17.0) (24, 17.0) (25, 17.0) (26, 17.0) (27, 17.0) (28, 18.0) (29, 18.0) (30, 18.0) (31, 18.0) (32, 18.0) (33, 18.0) (34, 18.0) (35, 18.0) (36, 18.0) (37, 18.0) (38, 18.0) (39, 18.0) (40, 18.0) (41, 18.0) (42, 18.0) (43, 18.0) (44, 18.0) (45, 18.0) (46, 18.0) (47, 18.0) (48, 18.0) (49, 18.0) (50, 18.0) (51, 18.0) (52, 18.0) (53, 18.0) (54, 19.0) (55, 19.0) (56, 19.0) (57, 19.0) (58, 19.0) (59, 19.0) (60, 19.0) (61, 19.0) (62, 19.0) (63, 19.0) (64, 19.0) (65, 19.0) (66, 19.0) (67, 19.0) (68, 19.0) (69, 19.0) (70, 19.0) (71, 19.0) (72, 19.0) (73, 19.0) (74, 19.0) (75, 19.0) (76, 19.0) (77, 19.0) (78, 19.0) (79, 19.0) (80, 19.0) (81, 19.0) (82, 19.0) (83, 19.0) (84, 19.0) (85, 19.0) (86, 19.0) (87, 19.0) (88, 19.0) (89, 19.0) (90, 19.0) (91, 19.0) (92, 19.0) (93, 19.0) (94, 19.0) (95, 19.0) (96, 19.0) (97, 19.0) (98, 19.0) (99, 19.0) (100, 19.0) (101, 19.0) (102, 19.0) (103, 19.0) (104, 19.0) (105, 19.0) (106, 19.0) (107, 19.0) (108, 19.0) (109, 19.0) (110, 19.0) (111, 19.0) (112, 19.0) (113, 19.0) (114, 20.0) (115, 20.0) (116, 20.0) (117, 20.0) (118, 20.0) (119, 20.0) (120, 20.0) (121, 20.0) (122, 20.0) (123, 20.0) (124, 20.0) (125, 20.0) (126, 20.0) (127, 20.0) (128, 20.0) (129, 20.0) (130, 20.0) (131, 20.0) (132, 20.0) (133, 20.0) (134, 20.0) (135, 20.0) (136, 20.0) (137, 20.0) (138, 20.0) (139, 20.0) (140, 20.0) (141, 20.0) (142, 20.0) (143, 20.0) (144, 20.0) (145, 20.0) (146, 20.0) (147, 20.0) (148, 20.0) (149, 20.0) (150, 20.0) (151, 20.0) (152, 20.0) (153, 20.0) (154, 20.0) (155, 20.0) (156, 20.0) (157, 20.0) (158, 20.0) (159, 20.0) (160, 20.0) (161, 20.0) (162, 20.0) (163, 20.0) (164, 20.0) (165, 20.0) (166, 20.0) (167, 20.0) (168, 20.0) (169, 19.0) (170, 19.0) (171, 19.0) (172, 19.0) (173, 19.0) (174, 19.0) (175, 19.0) (176, 19.0) (177, 19.0) (178, 19.0) (179, 18.0) (180, 18.0) (181, 18.0) (182, 18.0) (183, 16.0) (184, 16.0) (185, 16.0) (186, 16.0) (187, 15.0) (188, 15.0) (189, 15.0) (190, 15.0) (191, 15.0) (192, 15.0) (193, 13.0) (194, 13.0) (195, 12.0) (196, 12.0) (197, 10.0) (198, 10.0) (199, 10.0) (200, 10.0) (201, 7.0) (202, 7.0) (203, 7.0) (204, 7.0) (205, 7.0) (206, 7.0) (207, 6.0) (208, 6.0) (209, 5.0) (210, 5.0) (211, 5.0) (212, 5.0) (213, 5.0) (214, 5.0) (215, 4.0) (216, 4.0) (217, 4.0) (218, 4.0) (219, 3.0) (220, 3.0) (221, 3.0) (222, 3.0) (223, 3.0) (224, 3.0) (225, 3.0)

isaacovercast commented 7 years ago

I see, that makes sense. Can you send me a copy of your vcf so I can try to debug it? iovercast@gc.cuny.edu

biozzq commented 7 years ago

Dear @isaacovercast,

I have sent you some test data. hope it helps.

Wangchangsh commented 3 weeks ago

May I ask what the final solution was? I got the same problem.

isaacovercast commented 2 weeks ago

Here was the solution for this problem:

Thanks for sending the data. I have definitely figured out what's going on. I developed easySFS with RAD-like data in mind. The default will randomly select 1 snp per locus, so by default it was selecting 1 snp per chromosome on your data, which explains the bad output. If you run it like this it will use ALL the snps in your vcf file and it'll work much better:

./easySFS.py -i buzz/debug.vcf -p buzz/debug.pop -a --preview

Wangchangsh commented 2 weeks ago

Thanks for your advice.