labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
239 stars 28 forks source link

Not sure how to use genomic context option #88

Closed cmonat closed 1 year ago

cmonat commented 2 years ago

Hello,

I'm confused because of the following sentence in the documentation part Genomic context:

From version 1.2.45, it is possible to search genomic context in a pangenome graph using PPanGGOLiN. A genomic context corresponds to a group of genes/proteins with a functional interest, often found together in the genomes. They are detected by extracting a subgraph obtained by filtering edges connecting the sequences of interest in the pangenome.

If I understand correctly, it is possible to identify common genomic context between groups of prot with a specific interest. For that, I should provide the soft, theses sequences of interest.

I have made a try, by aligning some protein that I extracted from the NCBI to my pangenome (composed of 50 genomes) but it doesn't find any gene contexts. I have tried also to align some proteins that I know to be present inside my pangenome, but the results are the same: no gene contexts were found.

Did I misunderstand the purpose of this analysis, or did I miss something to make it work? Thank you in advance

Have a great day C.

axbazin commented 2 years ago

Hi,

@jpjarnoux will likely be more helpful than me on this one but he's away this week. I have not used this method a lot myself.

I think you have understood correctly the purpose of this method. Just in case, phrased differently, the idea is that, given a group of protein sequences, you try to find whether they are colocalized or not in the pangenome, so in other words if they are found generally close together among the genomes or not.

You should get an output file 'gene_context.tsv' in any case, whether the proteins are found or not, and whether they are close to each other or not. If they are indeed close to each other, they should have the same identifier in the first column of 'gene_context.tsv'. If they are found within the pangenome but not close to each other, they will have a different value in the first column, and the second column should be filled. If they are not in the pangenome, I think the behavior is that the 2nd column is filled with NA, but I'm not sure.

Does this answer you question ?

If not, maybe providing the log, or eventually the 'gene_context.tsv' file, will help me giving a more accurate answer.

Have a great day, Adelme

cmonat commented 2 years ago

Thanks for your anwser.

I have no gene_context.tsv file, the only output I got from these tries are input_to_pangenome_associations.blast-tab input_to_pangenome_associations.blast-tab_tmp sequences_partition_projection.tsv

So maybe something went wrong and cannot finish the process? But the logs seemed ok to me:

> ppanggolin context -p pangenome.h5 --sequences prot_from_ncbi.fasta -o genomic_context/ -f 2022-07-07 10:56:49 main.py:l214 INFO Command: /grid/sw/ppanggolin/1.2.74/bin/ppanggolin context -p pangenome.h5 --sequences cry_pro_from_ncbi.fasta -o genomic_context/ -f 2022-07-07 10:56:49 main.py:l215 INFO PPanGGOLiN version: 1.2.74 2022-07-07 10:56:49 readBinaries.py:l39 INFO Getting the current pangenome's status 2022-07-07 10:56:50 readBinaries.py:l389 INFO Reading pangenome annotations... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 330510/330510 [00:07<00:00, 44575.43gene/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:13<00:00, 3.80organism/s] 2022-07-07 10:57:10 readBinaries.py:l403 INFO Reading pangenome gene families... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 325979/325979 [00:06<00:00, 50700.97gene/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108769/108769 [00:06<00:00, 16009.59gene family/s] 2022-07-07 10:57:25 alignOnPang.py:l46 INFO Aligning sequences to cluster representatives... 2022-07-07 10:58:02 alignOnPang.py:l51 INFO Extracting alignments... 2022-07-07 10:58:03 searchGeneContext.py:l122 INFO Building the graph... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 8606.64families/s] 2022-07-07 10:58:03 searchGeneContext.py:l124 INFO Took 0.0 seconds to build the graph to find common gene contexts 2022-07-07 10:58:03 searchGeneContext.py:l138 INFO No gene contexts were found 2022-07-07 10:58:03 searchGeneContext.py:l140 INFO Computing gene contexts took 0.0 seconds

axbazin commented 2 years ago

Indeed, I'm afraid we will have to wait for Jérôme to return and answer this issue I do not know this module well enough, sorry.

Do you have 6 sequences in your prot_from_ncbi.fasta ? And 6 lines in sequences_partition_projection.tsv ?

cmonat commented 2 years ago

Ok, thanks :)

In the sequences_partition_projection.tsv I only have one line and there were 10 sequences in the prot_from_ncbi.fasta file ... =S

jpjarnoux commented 2 years ago

Hi !

Sorry, I've been very busy the last few weeks. I looked at your problem and on my test dataset and got a gene_context.tsv file. Maybe the issue is with using the default parameters. Could you try increasing the transitive parameter and decreasing the Jaccard similarity threshold? Run with the debug mode --verbose 2. Finally, your command will look at something like that : ppanggolin context -p pangenome.h5 --sequence my_seq.fasta -o out_directory --transitive 10 --jaccard 0.5 --verbose 2

If you find any context, we could speak more about the parameters.

cmonat commented 2 years ago

Hello,

I've made a try with your recommendations and I still don't get the final output gene_context.tsv.

The log says:

ppanggolin context -p pangenome.h5 --sequence prot_from_ncbi.fasta -o genomic_context/ --transitive 10 --jaccard 0.5 --verbose 2 2022-07-11 05:13:44 main.py:l214 INFO Command: /grid/sw/ppanggolin/1.2.74/bin/ppanggolin context -p pangenome.h5 --sequence prot_from_ncbi.fasta -o genomic_context/ --transitive 10 --jaccard 0.5 --verbose 2 2022-07-11 05:13:44 main.py:l215 INFO PPanGGOLiN version: 1.2.74 2022-07-11 05:13:44 readBinaries.py:l39 INFO Getting the current pangenome's status 2022-07-11 05:13:44 readBinaries.py:l389 INFO Reading pangenome annotations... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 330510/330510 [00:01<00:00, 176288.27gene/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00, 6.38organism/s] 2022-07-11 05:13:53 readBinaries.py:l403 INFO Reading pangenome gene families... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 325979/325979 [00:03<00:00, 95830.75gene/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108769/108769 [00:04<00:00, 25738.82gene family/s] 2022-07-11 05:14:02 alignOnPang.py:l45 DEBUG mmseqs search /tmp/28961776.1.all.q/tmpzc166p7c/tmpck3wtpa /tmp/28961776.1.all.q/tmpzc166p7c/tmpgzhiglx6 /tmp/28961776.1.all.q/tmpzc166p7c/tmp69tu3gp /tmp/28961776.1.all.q/tmpzc166p7c -a --min-seq-id 0.5 -c 0.8 --cov-mode 1 --threads 1 2022-07-11 05:14:02 alignOnPang.py:l46 INFO Aligning sequences to cluster representatives... 2022-07-11 05:14:27 alignOnPang.py:l50 DEBUG mmseqs convertalis /tmp/28961776.1.all.q/tmpzc166p7c/tmpck3wtpa /tmp/28961776.1.all.q/tmpzc166p7c/tmpgzhiglx6 /tmp/28961776.1.all.q/tmpzc166p7c/tmp69tu3gp genomic_context//input_to_pangenome_associations.blast-tab_tmp --format-mode 2 2022-07-11 05:14:27 alignOnPang.py:l51 INFO Extracting alignments... 2022-07-11 05:14:28 searchGeneContext.py:l122 INFO Building the graph... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 8530.79families/s] 2022-07-11 05:14:28 searchGeneContext.py:l124 INFO Took 0.0 seconds to build the graph to find common gene contexts 2022-07-11 05:14:28 searchGeneContext.py:l126 DEBUG There are 0 nodes and 0 edges 2022-07-11 05:14:28 searchGeneContext.py:l138 INFO No gene contexts were found 2022-07-11 05:14:28 searchGeneContext.py:l140 INFO Computing gene contexts took 0.0 seconds

When I ls the output dir I got this :

ls genomic_context/ input_to_pangenome_associations.blast-tab input_to_pangenome_associations.blast-tab_tmp sequences_partition_projection.tsv

Maybe 10 sequences to find genomic context is not enough? Thanks for your help

Regards C.

jpjarnoux commented 2 years ago

What I understand is that PPanGGOLiN find corresponding gene families to your input sequences. Nevertheless, in the pangenome, a context is not find. Is it possible for genes to be more than 10 genes apart in genomes? Because I think that the problem comes more from the parameters used than from a bug in the code. In any case, I will work on the code to make it more explicit and to understand what is going on. In the meantime, we should increase the transitivity to see if we don't end up detecting something.

cmonat commented 2 years ago

Ok, thank you very much for your help. I'll make some tests increasing the transitivity, thanks

Have a great day C.