Question about the - Githubissues

Orz-CQ commented 1 year ago

Hi Paul, Amazing software!

I have several questions about the HyDe. I see the input data should be diploid data or ambiguous sites. But here I have a diploid species and tetraploid species, and through hybridization we got a triploid species.

I have assembled the phased genome of the three species. After whole genome alignment, I attend to test if the triploid species indeed generated by the cross-breeding by the diploid species and tetraploid species.

Could I use the all phased genome alignments as input? For the map file, diploid may contain hap1 and hap2; the triploid may contain hap1, hap2 and hap3, et al. It is correct to perform the analysis?

pblischak commented 1 year ago

Thanks! This sounds like a really cool system! I think what you are describing using the phased chromosomes/haplotypes as "individuals" from the same population or species makes sense. So you map file would look something like this:

map.txt

hap1 diploid
hap2 diploid
hap1 triploid
hap2 triploid
hap3 triploid
hap1 tetraploid
hap2 tetraploid
hap3 tetraploid
hap4 tetraploid

The only other thing is do you have a fourth species to use as an outgroup?

Orz-CQ commented 1 year ago

Thanks for you suggestion!

I got a very confused result from the map.txt. The tetraploid was the hybridization results from the diploid and the triploid. For the background, we have already know that the massive introgression or ILS could be happened in this species. The total sites used for this calculation is 492,871,141 which I think is enough to get a powerful and robust results. Also the gamma is 0.6467936334503863.

I am wondering what's your opinion?

pblischak commented 1 year ago

I'm wondering if something with the ordering of the taxa in the map file and data file is causing an issue. Are the samples in the same order in both? And are all of the individuals in each taxon listed together?

Orz-CQ commented 1 year ago

Sorry for my late reply. I used the map.txt the same as you provided to me. So I think both the answer of the two questions could be yes.

pblischak commented 1 year ago

The map file I put above was just a rough example so if the data don't match it exactly it could definitely cause some issues. That map file also doesn't include an outgroup. I think double checking that everything is correctly aligned and that the names of the taxa in the data file and map file are the same would be good. If the data seem correct and you are getting the same result then let me know and we can take a closer look

Orz-CQ commented 1 year ago

Sorry for my confusing reply, the map file could be seen like this

A021    A02
A022    A02
A131    A13
A132    A13
A133    A13
A134    A13
A111    A11
A112    A11
A113    A11
ABC     out

The names A021, A022, A131, A132... are the sequences' name in the phylip file. which could be seen below,

10 sequence length
A021    CTAAACCCTAAACC
A022    -------------------
A111    CTAAACCCT

I performed the genome-scale alignment by cactus. And one thing need to know is that there were many absent site or gap in the alignment results.

pblischak commented 1 year ago

Okay, so a couple of things are sticking out to me. First, are you putting the string "sequence length" in your phylip file? If so, it needs to be a numeric value representing the actual sequence length in the input data. Second, in your example, the order of the individuals in the map file don't match the order in the data file: Individual A111 is 3rd in the data file but is 7th in the map file. The order of individuals needs to be the same between the two files.

I don't think gaps should be an issue because HyDe should be able to handle them

Orz-CQ commented 1 year ago

The "sequence length" has been masked, in the real data this is the number :) I will change the order and have a try. Since I get the results I will tell you soon.

Orz-CQ commented 1 year ago

Hi @pblischak, After changed the map file's individual order, I got the theoretical true results which the triploid species generated by the cross-breeding by the diploid species and tetraploid species.

But the gamma value seems to 0.7764855770352849 that the diploid proved nearly 78% contents? But this number is supposed to be 33%.

And I also tried the bootstrap_hyde.py with reps of 100. The gamma value could be range from 0.98 to 0.33.

Could you please give me some suggestion?

pblischak / HyDe

Question about the #38