therealcooperpark / hero

Highways Enumerated by Recombination Observations
MIT License
5 stars 1 forks source link

generating clusters #2

Closed smb20200615 closed 3 years ago

smb20200615 commented 3 years ago

Hello,

Many thanks for your wonderful tool. I was wondering if the metadata.tsv groupings was automatically generated by hero or if it is externally generated. If not, do you have any guidance on how to generate the groupings? I consulted the paper referenced but I am still unsure.

Also, do you have any guidance on how the roary command has to be altered if looking at recombination between different species?

Many thanks in advance,

therealcooperpark commented 3 years ago

The metadata.tsv file is externally generated. Generally, I see these groupings fall under two categories:

1) Environmental groupings - These groups will be defined on some ecological or environmental condition that distinguishes the different individuals. The examples in this category are endless, but I have typically used geographic criteria to sort isolates.

2) Genetic groupings - These groups are defined by a genetic or phylogenetic sorting mechanism. In the example provided we sort our isolates by Sequence Type identified by MLST. However, you may find other options that suit your needs as well.

The goal of providing metadata to this program is to help you identify how environmental/genetic conditions influence the distribution of recombination in your population. So your research question will likely help frame which metadata categories you want to consider when running this program.

Re Roary:

Roary is not intended to be used for comparisons above the species level. However, I have had some moderate success in doing so. You will want to modify the -i and -iv parameters depending on your needs. The -i parameter has the most obvious affect, but is also arbitrary in what you should set it to. In past work I have calculated the mean ANI across all pairs of isolates in a population then set the -i parameter to that mean ANI value. It is still arbitrary, but at least it has some relevance to the data.

smb20200615 commented 3 years ago

Many thanks for your quick response. Do you think you will integrate your pipeline with panaroo (https://github.com/gtonkinhill/panaroo) as well in the future? I would imagine the options would be -a core --aligner maaft. By default panaroo produces the folder aligned_gene_sequences (so -z option wouldn't be needed). Do you think this will work instead of roary?

Also, is there a reason that we want to generate a core gene alignment rather than pan alignment? Apologies if I am misunderstanding some concepts.

therealcooperpark commented 3 years ago

Do you think you will integrate your pipeline with panaroo (https://github.com/gtonkinhill/panaroo) as well in the future?

Unfortunately, I have since left the position where I designed this program, so I don't currently have the time to continue development on this project.

However, Panaroo actually seems like it would work directly with HERO already. The only files HERO takes from Roary are in the directory of individual gene alignments (hence the -z parameter). If you can create that same directory with Panaroo then HERO should still function the same. Based on my quick check I would set parameters as -a pan --aligner mafft if you want to get similar result files as Roary.

Also, is there a reason that we want to generate a core gene alignment rather than pan alignment? Apologies if I am misunderstanding some concepts.

This depends on your needs. The concatenated core gene alignment will not be used by HERO, but it would make a good framework for phylogenetic studies if you want to generate groupings phylogenetically. As mentioned above, to run HERO you just need a directory of individual gene alignments which should be achievable with -a pan. However, I have not used Panaroo so this is not guaranteed.

smb20200615 commented 3 years ago

Many thanks for your guidance! Greatly appreciated.

smb20200615 commented 3 years ago

Sorry for the subsequent question but it seems as though the number provided in gene_counts.txt differs from the number of recent recombination events provided by fastgear. Is there a reason for this?

therealcooperpark commented 3 years ago

Good catch!

Yes the count will differ slightly for a few reasons:

1) The count is based on the number of events per recipient group, therefore if Group X donated the same information to Group's A & B, then HERO counts that as 2 events. (There is a good argument here for why that may not be the case biologically, but it's the logic used in the program).

2) The count is based only on events that could be parsed down to the Metadata level. If the list of equally-likely donor genomes for an event do not all come from the same Metadata group, then the event will be discarded from the rest of the analyses and therefore not be counted in the gene_counts file.

3) Events that are filtered for other reasons (e.g., fragment length, Bayes Factor, external donor) are also not counted.

smb20200615 commented 3 years ago

Thank you so much for the thorough explanation. It seems like HERO makes very specific assumptions. So what type of questions do you think HERO is best suited to answering?

smb20200615 commented 3 years ago

Also, the circos plot doesn't seem to agree with the recombination rate that I calculated for each group. So groups with a higher recombination rate are showing to have many more recombination events in the circos plot. Also do the two circles in the circos plot represent donor and recipient events respectively? Thanks again!

therealcooperpark commented 3 years ago

Thank you so much for the thorough explanation. It seems like HERO makes very specific assumptions. So what type of questions do you think HERO is best suited to answering?

I think the best way to answer this is to provide my manuscript to the repo for future use. Unfortunately, due to Covid the circumstances among the authors have changed and we won't be able to finish the publication process for it. So it will live on as a "pre-print" here. You should see a new folder in the repo with the manuscript and supplementary files there.

Additionally, we've used similar (though not identical) techniques to measure HERO-like data in other manuscripts of ours. I can point to some here in case you find that useful:

https://pubmed.ncbi.nlm.nih.gov/33246085/ https://pubmed.ncbi.nlm.nih.gov/31937675/

Also, the circos plot doesn't seem to agree with the recombination rate that I calculated for each group. So groups with a higher recombination rate are showing to have many more recombination events in the circos plot.

Though we have not tested this, I would assume that a higher recombination rate would lead to a larger chunk in the circos plot, particularly as recipients. Are you saying you are not seeing this?

Also do the two circles in the circos plot represent donor and recipient events respectively?

There should only be one circle per file, so if you are seeing multiple please let me know! However, the two different circos files are the same image. The difference is that highway_circos grays out all paths that are not considered highways of recombination. The direction of donor/recipient is determined by the arrow head on the arrows.

smb20200615 commented 2 years ago

Thank you so much for your thorough explanations and help. I ran it and get the following error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "tools/hero/hero.py", line 70, in unpack_arguments
    return parse_fastgear(arg_list[0], arg_list[1], arg_list[2])
  File "tools/hero/hero.py", line 123, in parse_fastgear
    return parse_recombination(fastgear_path, Genome, gene_name)
  File "tools/hero/hero.py", line 190, in parse_recombination
    if fragment_len < length or (math.e**logbf) < bayes or d_lineage == external_donor:
OverflowError: (34, 'Numerical result out of range')
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tools/hero/hero.py", line 653, in <module>
    recombination_events = pool.map(unpack_arguments, genes)
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
OverflowError: (34, 'Numerical result out of range')

Is it because you are using floats?

Dx-wmc commented 1 year ago

Thank you so much for your thorough explanations and help. I ran it and get the following error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "tools/hero/hero.py", line 70, in unpack_arguments
    return parse_fastgear(arg_list[0], arg_list[1], arg_list[2])
  File "tools/hero/hero.py", line 123, in parse_fastgear
    return parse_recombination(fastgear_path, Genome, gene_name)
  File "tools/hero/hero.py", line 190, in parse_recombination
    if fragment_len < length or (math.e**logbf) < bayes or d_lineage == external_donor:
OverflowError: (34, 'Numerical result out of range')
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tools/hero/hero.py", line 653, in <module>
    recombination_events = pool.map(unpack_arguments, genes)
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
OverflowError: (34, 'Numerical result out of range')

Is it because you are using floats?

I encountered the same problem when I was running the last step of hero. Where did the question go wrong? Is there a solution?

ntromas commented 1 year ago

Hi! Same issue here... Any solution to fix this? Thanks!

Nico