iferres / pagoo

A comprehensive and intuitive encapsulated OO class system for analyzing bacterial pangenomes in R.
https://iferres.github.io/pagoo/
28 stars 4 forks source link

Is there a function pagoo_2_roary ? #58

Open diaspj opened 2 years ago

diaspj commented 2 years ago

Hello to the pagoo team,

First, my congratulations on the great work made on the development of pagoo R library.

The main question I want to put in this forum is in the following:

1) Is there a easy and quick way of converting the pagoo pangenome object into the gene_presence_absence_csv output file and other important output files generated by roary?

2) If not, are there plans to make available?

3) If there are no plans to make it available, is it possible for the pagoo team to leave some quick indications on how to construct the gene_presence_absence_csv and other important roary output files in an easy and quick way from the pagoo pangenome R6 class object, PgR6MS.

I don't have much knowledge on the internal way how roary generates these output files and which ones are important to allow being used as input by other downstream softwares.

The reasoning for being useful the generation of this gene_presence_absence_csv file is in case pagoo does not make available a certain function that might be useful in the analysis of the Pangenome of a given bacterial species but other post-processing software do, the pagoo users might have an easy access to these functions.

iferres commented 2 years ago

Hi @diaspj, thanks for your interest in the software!

1) There's not. 2) We have discussed it, but not concrete plans yet. 3) We could write a recipe for generating one or some of the roary output files, may be the gene_presence_absence.csv which is the one which contains the most important information. That would certainly be useful for some developers, I agree. I'm not sure in making the effort to recreate all the output files, however. I let that to each user since I don't have the bandwidth right now since I'm focusing on finishing my PhD (and I'm the only developer of the package).

I will try to come back with some indications later.

diaspj commented 2 years ago

Thanks Ignacio for your quick feed-back,

I'm sure that you will get full mark on your PhD degree, pagoo is very useful for researchers that like to use R as their main scripting language.

I went to check the roary output files and, based on the "cfetus_pangenome" example you have provided in protocol example, I think now I understood the structure of the output file gene_presence_absence_csv.

In fact, I went to check panaroo, and their Post-processing pipeline seems to be based only on 4 output files:

1) gene_presence_absence.csv

2) gene_presence_absence.Rtab

3) final_graph.gml

4) struct_presence_absence.csv

The "struct_presence_absence.csv" seems also to be important in case we want to "clean" the plasmids from the Pangenome but, personally, I'm not that much of a fan of removing genomic information that, although rare, might be important to understand potential phenotypes observed in a particular strain.

So, counting out the "struct_presence_absence.csv", the first two files, "gene_presence_absence.csv" and "gene_presence_absence.Rtab" seem easy to construct from the pagoo computational object.

Hence, the most worksome file consists in the "final_graph.gml", which I believe is one that gets more attention from the panaroo users, because it provides a means to visualize the Pangenome using Cytoscape.

The panaroo team provides a tutorial on "Using a reference genome to produce less convoluted layouts", based on the script "reference_based_layout.py" and command

python ~/repos/panaroo/scripts/reference_based_layout.py 0 final_graph.gml capacity_cut_edges.txt --add_reference_edges

that allow to remove the edges that are introducing convolution. The graphical display of the processed Pangenome is quite convincing!

Since you are on the way to finish your PhD, maybe adding an extension to Pagoo so the software will be able to output an equivalent panaroo .gml file, providing a network representing the circular bacterial chromosome and the corresponding variations might be an interesting thing for you to do, to impress the PhD juri :D

I wish you good luck on your PhD trials and hope to see in some months the introduction of extensions to pagoo, including "pagoo_2_roary" and the ability of outputing an equivalent panaroo .gml file.

Best regards,

Paulo Dias

iferres commented 2 years ago

To create the graph file we need the synteny information, i.e. the gene/feature coordinates on each contig (ideally chromosome) and the strand. I design pagoo classes to be relatively easy to extend, or in other words, to create more complex classes which inherits from the pagoo's, and produce more complex analyses or use other information that the base classes. Since the basic pangenome structure considers only that genes belongs to organisms and are assigned to clusters, I opted for leaving the genetic context out of the class definition, and coordinate information is treated just like any other metadata. Can be there, but methods just aren't aware. Users can add the genetic coordinates as gene metadata (from, to, strand, contig), but these basic classes won't have methods to cope with that.

That said, the idea behind pagoo was to develop a post analysis system for a pangenome reconstruction software I'm developing (last chapter of my theses ;) still under dev) which returns a modified pagoo class which is aware of the genetic coordinates information and provides methods to, for instance, plot the genetic context of a cluster of orthologs. In that case one could think of specific methods to plot a pangenome graph (or to produce the gml file) since this info will be mandatory.

If you used the gff files, the coordinate information is already in the object as gene metadata. One could think of a code recipe to use this information and produce a graph, but not as a method itself (i.e. which could be directly called from the object using a p$...) for base pagoo classes.

In my experience graphs/network algorithms have to be designed very carefully with pangenome data because the complexity can easily explote with many genomes.

diaspj commented 2 years ago

Dear Ignacio,

Has you have said, if you have the gff files, you have the synteny information required. That is my case, I have developed my own methodology to identify “clusters of amino acidic sequence similarity”, a proxy to the notion of gene family, in microorganisms of interest.

So, for me, using the Pagoo framework to construct a Pangenome computational object in the R environment makes lots of sense because I just need to organize my data into columns that you require to be used as input. If you have made available the possibility of importing gene coordinates for each of the genes, I could easily organize my data into the necessary columns.

That been said, I would suggest that adding to the Pagoo framework the ability of doing synteny analysis would be a highly desired feature!

In my view, the most valuable thing that Pagoo has going is making available a pipeline for extracting results from the Pangenome computational object in an easy and quick manner, taking advantage of the R awesome set of statistical and bioinformatics libraries. I believe that increasing the value of Pagoo for researchers and its usefulness will pass by making more downstream analysis available for the R users, such as the ones as the Panaroo framework makes available, including the visualization of the Pangenome and making equivalent analyses to the ones made available by the Panaroo framework regarding Pangenome association studies and the identification of co-evolving genes.

I have tried the analysis that the Panaroo developers suggested based on Cytoscape, and although I had like ~60 bacterial genomes, it worked fine, allowing to decrease the complexity of the network. When much more genomes are used, I believe it will possible to adjust the parameters used by Panaroo to cut more edges, thereby allowing to obtain simplified network with less complexity.

The reproduction of this Panaroo script for decreasing the complexity of the network in the Pagoo framework would be also a highly desired feature!

I believe that when users suspect that there might be a problem of splitting a single gene into two distinct entities because of a sequencing error or a bad option made by the genome annotation too, many will value the option of being able to perform a manual inspection of the network at a local level, and this requires visualization.

Will be waiting for the future upgrade of the Pagoo framework, making available additional functionalities as the ones mentioned above, continue the good work, and good luck in the PhD writing and defense,

Best regards,

Paulo Dias