Clarification about the contents of `gene_to_gene_family.tsv ` from projection

labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes

https://ppanggolin.readthedocs.io

Other

230 stars 25 forks source link

Clarification about the contents of `gene_to_gene_family.tsv ` from projection #221

Closed szhan closed 1 week ago

szhan commented 2 months ago

I have been running projection on a reconstructed pangenome and a set of assembly FastA files for input genomes, in order to assign each gene to a gene family in the pangenome for each input genome.

I tried consulting the documentation about the output of projection, but the link doesn't seem to go anywhere (https://github.com/labgem/PPanGGOLiN/blob/f3ba6a1f33256f19175b570c4b711bb8970d0365/docs/user/projection.md).

The documentation states that gene_to_gene_family.tsv "provides the mapping of genes to gene families of the pangenome." I was expecting to see one line per gene for an input genome, which indicates that the gene in a line is assigned to a gene family in the reconstructed pangenome. But this isn't what I got. Instead, I got files with 100s of thousands of lines, even though an input genome contains 2.5k to 2.9k genes.

Any clarifications would be much appreciated. Thank you in advance.

axbazin commented 2 months ago

Hi,

The "projection" documentation about its output files is here: https://ppanggolin.readthedocs.io/en/latest/user/projection.html#output-files

However, indeed it is right that the current behavior is not the one that was intended. I see where the bug is. Currently, the "gene_to_gene_family.tsv" file contains this information for ALL given input genomes, and not just the single input genome. The file is likely equal between the different "input genome" output directories. we'll get a fix for this in the upcoming version.

Thank you very much for the bug report.

Adelme

szhan commented 2 months ago

Thank you for the explanation. I checked whether "The file is likely equal between the different "input genome" output directories" for a few input genomes. But it didn't seem to be the case. I look forward to the updated version. Thank you.

szhan commented 2 months ago

Also, I was referring to https://github.com/labgem/PPanGGOLiN/blob/f3ba6a1f33256f19175b570c4b711bb8970d0365/docs/user/Outputs.md#gene-families-and-genes, which doesn't seem to exist anymore, in https://github.com/labgem/PPanGGOLiN/blob/f3ba6a1f33256f19175b570c4b711bb8970d0365/docs/user/projection.md

axbazin commented 2 months ago

Alright thank you for the additional input, and indeed I misunderstood what you meant, I see the broken link now ! Will fix this as well.

JeanMainguy commented 1 week ago

The fix for this issue has been released in v2.1.0.