veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

GARD on Transposable Elements #1280

Closed eliasprim closed 3 years ago

eliasprim commented 3 years ago

Hello Hyphy team,

I am a new user of GARD and I am currently analysing the LTR sequences of LTR retrotransposons. The LTRs are non-coding sequences, and so, based on other posts here for running GARD for non-coding sequences, I use the parameters Data Type: Nucleotide, Genetic Code: Universal, Site to site rate variation: General Discrete and Rate Classes: 4. The two following links are the results of two analyses for 25 and 50 LTR sequences respectively. By checking these results, I have some questions.

http://datamonkey.org/gard/60199269d83df369d29a9cff

http://datamonkey.org/gard/5fabfe828e3372615066c304

1) In every run, I observe that the best breakpoint model is typically the last one with the most breakpoints and the lowest score of Δ AICc. Is this correct?

2) For most runs I used a small number of sequences (10-25 LTRs). In all the GARD reports, including the first one above, I always observe a single peak that is always at the very start of the GARD site graph. As far as I understand, this peak corresponds to the only breakpoint that is strongly supported by the analysis. I have tried LTRs from various families and this is always the case. Is it a true result or an artefact possibly caused by the small number of sequences? The only time that I have used more LTRs (50), then a second peak appeared (second link above). If indeed a higher number of sequences is needed, then which is a reasonable number in order to get trustworthy results?

3) In the papers that GARD was used, authors state the there is a breakpoint in x nucleotide with p-value y. Do I have to analyse the json output in a specific way in to get this p-value?

Thank you very much in advance for your time.

Kind regards,

Elias Primetis

spond commented 3 years ago

Dear @eliasprim,

Let me take a closer look at your results and get back to you. In the meantime

  1. Yes, that is correct. The analysis will keep adding breakpoints until there is no further improvement in c-AIC (or the analysis runs out of allocated time).
  2. That peak looks anomalous; there should be others. Let me investigate.
  3. Older versions of GARD used to include a crude KH test; this has now been removed in favor of the overall c-AIC test. Once you infer breakpoints with GARD, you can test for incongruence downstream with IQ-Tree or RaXML (they do a much better job at it).

Best, Sergei

eliasprim commented 3 years ago

Dear @spond,

Thank you very much for your quick response. I am looking forward to your investigation results.

I just want to note that some of the sequences in the alignment of the 50 sequences have an insertion. Also, in this running I used a slightly different parameters, because this was a trial run in order to understand better the GARD parameters.

Kind regards,

Elias Primetis

eliasprim commented 3 years ago

Dear @spond,

I would like to ask you if there is any update for my issue.

Thank you in advance.

Kind regards,

Elias Primetis

spond commented 3 years ago

Dear @eliasprim,

I pushed a fix for your issue to the dev branch. It will be released with 2.5.29 and pushed to Datamonkey in the next few days. In the meantime, you can run GARD locally (follow install instruction from here https://github.com/veg/hyphy-analyses)

Once installed, run

hyphy gard --alignment /path/to/file --mode Faster 

You can visualize GARD results in https://observablehq.com/@spond/plotting-gard-breakpoint-support

Example output for one of your analyses is attached (uncompress before uploading) and breakpoint support looks like this

plotting-gard-breakpoint-support

Best, Sergei

1280.fasta.GARD.json.zip

eliasprim commented 3 years ago

Dear @spond,

Thank you very much for your help.

I have already installed and run GARD locally and I did not know how to analyse the .json output, but now I know. Thanks again.

Kind regards,

Elias Primetis

spond commented 3 years ago

Dear @eliasprim,

Make sure you check out the develop branch to gain access to the fixes immediately.

Best, Sergei

eliasprim commented 3 years ago

Dear @spond,

Yes, I will check it out. Thank you.

Kind regards,

Elias

eliasprim commented 3 years ago

Dear @spond,

I used the online GARD for 100 sequences and I visualized the json output by using the link you have sent me. As you can see in the following picture there are 3 inferred breakpoints. Can I consider them as significant?

image

Kind regards,

Elias

spond commented 3 years ago

Dear @eliasprim,

Yes, according to the Δ c-AIC values, the model with multiple different trees is preferred to both the null model (no recombination) and the "single tree multiple partition" (same topology but different rates) model. Looking at Figure 1 you can also notice that the first breakpoint is ~800 (strongest signal), followed by the second breakpoint ~375, and then followed by the third around ~1000.

If you want to perform additional validation, you can run the Shimodaira-Hasegawa type test using RaXML or IQ-Tree.

Best, Sergei

eliasprim commented 3 years ago

Dear @spond,

Thank you, I just wanted to check that I understand the result correctly.

Thank you very much for all your help.

Kind regards,

Elias