szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
111 stars 33 forks source link

get ancestral allele #18

Closed biozzq closed 8 years ago

biozzq commented 8 years ago

Hi,

Very nice tool. After reading the manual, I have some puzzles for the input files. I can not understand the ancestral allele when I just have a domesticated population. Can I just use the reference allele in the VCF file instead of ancestral allele ?

More, If I have the ancestral allele file, can I include the heterozyous SNP in the hap file ? From the example files deposited in the selscan directory, I just saw a single haploid copy from an individual, does this mean the derived allele should be homozygous in a diploid individual ?

I hope you can help me with above questions.

Many thanks!

szpiech commented 8 years ago

If ancestral allele information isn't available, you can use reference/alternate coding instead. You should include all genotypes that you have information on, and then filter loci with a MAF < 0.05 (either yourself or with selscan's --skip-low-freq flag). Also, please note that your data must be phased.

Whatever the ploidy of your organism, you should include all copies (the example file was generated with a simulation program that simulates haploids).

Hope this helps!

biozzq commented 8 years ago

Hi @szpiech,

Thanks, yes I will phased my data using beagle 4, using following command, Is this right?

java -Xmx500g -jar beagle.14Jan16.841.jar gt=input.vcf nthreads=6 out=output gprobs=true

Before this, I will have a quality control (eq, MAF>=0.05, missing rate < 0.1).

If I use the reference allele instead of ancestral allele, how should I prepare the hap file for a diploid individua when came cross a heterozygous site? It will contain both ancestral allele and derived allele. For example, I have three samples including one locus, what does it look like for the hap file?

example chr pos ref alt sample1 sample2 sample3 1 20 A G 0|1 1|0 1|1 1 50 G T 1|0 0|1 0|0

In my understanding on your comments above, the ancestral allele while be A and derived allele is G, than the hap file will like following, is this right? 0 1 1 0 1 0 0 1 1 0 1 0

many thanks. best wishes!

biozzq commented 8 years ago

Hi @szpiech ,

I noticed that the genotype after running beagle contains allele separator |, I think I can prepare the hap file with the two copies file just like above.

Many thanks!

szpiech commented 8 years ago

Hi @biozzq, I haven't phased with Beagle in a while, so I can't offer much advice on using it for phasing. Please note that selscan currently does not handle missing data, so you'll have to either remove those loci or impute them.

Selscan will accept VCF files, and what you've written

chr pos ref alt sample1 sample2 sample3
1 20 A G 0|1 1|0 1|1
1 50 G T 1|0 0|1 0|0

will be sufficient. Although you will need to also provide a mapfile if calculating anything other than nSL. Let me know if you have any trouble!

biozzq commented 8 years ago

Hi @szpiech,

Recently, I have a test (calculate iHS) on my own data, before this, I have filtered the site with MAF < 0.05, but nothing records in the output file using both VCF file and tped file.

The commands: selscan --ihs --vcf test.vcf --map test.map --out test_ihs --threads 20 selscan --ihs --tped test.tped --out test_tped --threads 20

I noticed that the sites on the chromosome edge have been skipped from the log file. selscan --ihs --tped test.tped --out test_tped --threads 20 v1.1.0a Calculating iHS. Input filename: test.tped Output file: test_tped.ihs.out Threads: 20 Scale parameter: 20000 Max gap parameter: 200000 EHH cutoff value: 0.05 Alt flag set: no WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at chr10:2498 WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at chr10:3015 WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at chr10:3410 WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at chr10:2472 WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at chr10:3723 but most sites are not skipped, why it still does not give me anything?

The input format like following chr10 chr10:2472 0 2472 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 chr10 chr10:2498 0 2498 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1

What is more, when I calculate EHH, it will give me an error for the flag not recognized. Can I use these formats to calculate EHH ? Many thanks!

szpiech commented 8 years ago

Well, it looks like you have no genetic map information. If you don't have a genetic map, you can duplicate your physical distance column and put it in the genetic map column or you can use nSL.