popgenmethods / smcpp

SMC++ infers population history from whole-genome sequence data.
GNU General Public License v3.0
152 stars 35 forks source link

Mask file causes warning #200

Open yaohaojiao opened 3 years ago

yaohaojiao commented 3 years ago

Hi,i am running vcf2smc and get a WARNING :

$ docker run --rm -v $PWD:/mnt terhorst/smcpp:latest vcf2smc -m Oc_genome.mask.bed.gz -d gaoqiao_14 gaoqiao_14 remove_multiple_POS-2.vcf.gz ./test/test.GQ14_chr1.smc.gz Chr1 GQ:gaoqiao_11,gaoqiao_13,gaoqiao_14,gaoqiao_17,gaoqiao_18,gaoqiao_20,gaoqiao_21,gaoqiao_23,gaoqiao_3,gaoqiao_4,gaoqiao_8 --core 48

1358 smcpp.commands.vcf2smc INFO Population 1:

1358 smcpp.commands.vcf2smc INFO Distinguished lineages: gaoqiao_14:0, gaoqiao_14:1

1358 smcpp.commands.vcf2smc INFO Undistinguished lineages: gaoqiao_11:0, gaoqiao_11:1, gaoqiao_13:0, gaoqiao_13:1,

gaoqiao_17:0, gaoqiao_17:1, gaoqiao_18:0, gaoqiao_18:1, gaoqiao_20:0, gaoqiao_20:1, gaoqiao_21:0, gaoqiao_21:1,

gaoqiao_23:0, gaoqiao_23:1, gaoqiao_3:0, gaoqiao_3:1, gaoqiao_4:0, gaoqiao_4:1, gaoqiao_8:0, gaoqiao_8:1

[W::hts_idx_load3] The index file is older than the data file: Oc_genome.mask.bed.gz.tbi

40%|███▉ | 14.9M/37.5M [01:09<01:45, 215kbases/s]

70733 smcpp.util INFO Wrote 691970 observations

70733 smcpp.commands.vcf2smc WARNING Multiple entries found at 1264 positions; skipped all but the first

Is that mean my vcf file contain multiple POS on one snp? i don't know why because i ‘ve used the bcftools norm -d none to deal with it . And if i remove option -m Oc_genome.mask.bed.gz ,no warning will appear whether I use bcftoolsor not!I don't think there is a problem with mask file checked.

How can I solve this problem?

Thank you!

yaohaojiao commented 3 years ago

Supplement:

I try again by this way but get the same WARNING,and i not found the 1264 positions when i grep my vcf file .

Chr1 126419 Chr1 126420 Chr1 126421 Chr1 126424 Chr1 126425 Chr1 126426 Chr1 126431 Chr1 126434 Chr1 126442 Chr1 126444 Chr1 126447 Chr1 126451 Chr1 126453 Chr1 126461 Chr1 126464 Chr1 1264021 Chr1 1264025 Chr1 1264042 Chr1 1264047 Chr1 1264050 Chr1 1264052 ……

@terhorst Could you please help me with this issue? Many thanks!

ericgonzalezs commented 1 month ago

I am having the same warning:

smcpp.commands.vcf2smc WARNING Multiple entries found at 1126 positions; skipped all but the first

I checked both my vcf file and my bed file for repeated positions like this:

zcat my.vcd.gz | grep -v "#"  | grep "Chr01"  |  cut -f 2 | sort | uniq -D
zcat MY.bed.gz | grep "Chr01"  | cut -f 3 | sort | uniq -D
zcat MY.bed.gz | grep "Chr01"  | cut -f 3 | sort | uniq -D

And I didin't get any possition.

My bed file looks like this: Chr01 0 398 Chr01 503 613 Chr01 710 753 Chr01 837 975 Chr01 1104 1488 Chr01 1623 2061

and my vcf is a phased vcf file, phased with Beagle.

I am running the program like this:

smc++ vcf2smc -d ind1 ind2 -m my.bed.gz myvcf.vcf.gz chr1.smc.gz Chr01 ANN1:ind1,ind2,id3,ind4,ind5,ind6,ind7,ind8,ind9,ind10

If I run the program like this

smc++ vcf2smc -d ind1 ind2 myvcf.vcf.gz chr1.smc.gz Chr01 ANN1:ind1,ind2,id3,ind4,ind5,ind6,ind7,ind8,ind9,ind10

without the bed file, I don't get the warning.

If I do this on my bed file

bedtools intersect -a my.bed.gz -b my.bed.gz -c > overlap.txt

and this:

cut -f 4 overlap.txt | sort | uniq

I got only the value 1, which means there are no overlapping positions.

Does anyone know if I am missing something?

The smc++ version I am running is:

SMC++ v1.15.5.dev14+g6779fae

yaohaojiao commented 1 month ago

您发的邮件我已收到,谢谢!