Gene family count matrix related with representative gene family .

aababc1 commented 7 months ago

Thank you for creating such a great tool. I have one question, your tool doesn't seem to generate the gene count matrix directly. If I want to generate a count table in addition to gene presence absence, I wonder if I need to manually generate a count table using the gene_families.tsv information. Or is there some other command that supports this functionality in ppanggolin? Thank you in advance for your answer.

axbazin commented 7 months ago

Thank you for your kind words!

Indeed I do not think we have such a file among the possible outputs. The closest would probably be the "matrix.csv" file (see this which has all the information you want, but has the list of genes rather than the raw count itself. Transforming this file may be your easiest way out.

Otherwise, using the gene_families.tsv file along with one of the file that links genes to genomes (e.g., the gff or the genes annotation table files) is a good solution too.

Adelme

aababc1 commented 7 months ago

Thank you for your prompt response and suggestion!

I have two more question .

As your suggestion,I generated custom script for that. The number of the all genes in result file matched with the number of lines in gene_families.tsv . It would be greatful if you see attached file could be routinely integrated with ppanggolin in my analysis pipeline.
make_count_table.py.txt

First one is, ppanggolin represent F in third column in gene_families.tsv file that are fragmented. I included them for the analysis. I thought the fragmented gene sequences could be annotated functionally and it could be utilized for downstream anlaysis. I wonder your opinion about fragmented genes information inclusion in downstream analysis.

Second one is about gene families threshold. In the paper,coverage 80 % identity 80% was utilized for gene family construction. This are frequently used for gene family clustering, but I have question about adjusting the coverage and identity for species level in microbial comparative genomic analysis. If the species are different, users should choose different clustering criteria , or just default values could be utilized for analysis? And if someone lowering the identity to 50%, there could be severe bias introduced in downstream analysis based on pangenome function annotation based on pangenome reference sequences?

Thank you very much Adelme

`

axbazin commented 7 months ago

Your script looks fine for me, it does seem to be doing what you want.

About the fragmented genes, it depends on the "downstream analysis" and the biological question. From the technical point of view, if you are annotating genes independently from their gene families, then I think it is fine. If you are doing functional annotations at the scope of gene families, I'd remove them as they may not be able to realize the "function" that they would be annotated with.

For the question of gene families threshold, indeed I'd recommend to lower the identity threshold for clustering. If they are "close" sister species (e.g. Neisseria meningitidis and Neisseria gonorrhoeae) 80% is fine, but in general lowering it is better. However you are correct, it may generate a strong bias if you are annotating your gene families, as some paralogs with different functions may be annotated exactly the same way, in that case. That will only be true for some families though. It's a balance to have between wrongly clustered paralogs and wrongly splitted orthologs, you can adjust the threshold depending on what's important to your own analysis/biological question.

In my opinion, while annotating gene families is "practical" and much faster, annotating genes directly is still best if you want to avoid mistakes as much as possible.

Have a nice day! Adelme

aababc1 commented 7 months ago

Thank you so much for your very detail explanation .

I asked the gene family clustering threshold and fragmented gene families because I am handling fragmented genomes such as MAG. As you commented, annotate genomes individually will show best accuracy I think. I will test some things based on you advice. My questions are all resolved. Thank you once again.

Have a nice day!

labgem / PPanGGOLiN

Gene family count matrix related with representative gene family . #217