Closed IsabelFE closed 2 years ago
I am attaching the files I am using:
The .tsv file created from the anvi-summarize
info:
Clusters_Dpig_Anvio7.tsv.txt
The list files for the .fasta and .gff files: Anvio7GenomesAnno.txt Anvio7GenomesFasta.txt
The .fasta and .gff files exported from Anvio: Anvio7_Exported_Genomes.zip
Hello, Your error "UnboundLocalError: local variable 'contig' referenced before assignment" is quite funnily related to an issue that was fixed today (the fact of not having '##sequence-region' at the beginning of the gff), so you'll need to install the github version of ppanggolin to have it work. However, the genomes in the first gff that I opened "Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff" is extremely strange. All genes start on position 1 even though they are on the same dna molecule 'c_000000000001'. This does not sound right to me. Unless I am missreading something, there is probably a problem there.
Adelme
Hi,
Yeah, I saw the issue with not having the '##sequence-region'. I will install the new version. I also thought that the .gff exported from Anvio look weird, it is like they have the right end positions, but they all start on position 1. @meren is this an issue with Anvio not exporting the gff properly?
Thanks @axbazin
Dear @axbazin and @IsabelFE,
Apologies for the late response. I finally had a chance to take look at the export-gff function anvi-get-sequences-for-gene-calls
is using and found a bug we recently introduced. I now have fixed the issue via https://github.com/merenlab/anvio/commit/a9f73284224f12024247c8b46f5f268ed929733b in our main branch, which means, if you are tracking the main branch, your new GFF files should look much better, @IsabelFE.
Thank you very much for your patience and investigating how to make these two tools talk. Please let me know if you run into other problems and I will be happy to work on them on our end. Using your experience and expertise, we can even implement an anvi'o script to export everything necessary from an anvi'o pangenome to be able to continue working with it in PPAnGGOLiN.
Best wishes,
Hello,
Watching more carefully the files, there might be one last trouble. To be able to differentiate the genes within the cluster file, genes must have a unique identifier throughout all of the gff files and in the cluster file. As it stands it seems that Anvi'o gives incremental gene identifiers from 1 to n in each genome, which will be problematic in differentiating genes in between the different genomes.
A possible solution would be to give unique IDs to each gene and generate the cluster file with those, but I am not very familiar with Anvi'o so I do not know how doable it is.
Of course an ideal solution for this time and the next ones would most likely to have a script such as @meren suggested, that generates the gff and the tsv files with unique identifiers.
Adelme
We certainly can fix that :) If I have access to some 'ideal input files' from @IsabelFE, @merenlab can use that as a model to develop a tool.
Dear @axbazin and @meren,
Thanks a lot for helping solve the compatibility between your tools. I updated Anvio to the main GitHub branch version and the issue with the gffs is solved. As @axbazin indicated I got an error about the gene IDs.
From the anvi-summarize
output I generated the .tsv file on R like this:
Dpig_Anvio7 <- read_delim("analysis_Anvio7/Pangenomic_Results_Dpig/Dpig-PAN-SUMMARY/PAN_DPIG_prokka_gene_clusters_summary.txt.gz", "\t")
Dpig_Anvio7 <- Dpig_Anvio7 %>%
unite(new_id, genome_name:gene_callers_id, sep = "_", remove = FALSE)
Clusters_Dpig_Anvio7 <- select(Dpig_Anvio7, c(gene_cluster_id, new_id))
The output looks like this Clusters_Dpig_Anvio7.tsv.txt:
GC_00000001 ATCC_51524_900
GC_00000001 KPL1914_1178
GC_00000001 KPL1914_475
GC_00000001 KPL1922_CDC39_95_1080
GC_00000001 KPL1922_CDC39_95_135
This is, the first column is the gene_cluster_id and the second is new_id (concatenated genome_name + gene_callersid, separated by `) (Should I have separated them with something different from
_`??)
@meren, is there a way to export the .gffs using this new_id format instead of the gene_callers_id?
Like:
c_000000000001 . CDS 1 1338 . + . ID=ATCC_51524_1
c_000000000001 . CDS 1569 2705 . + . ID=ATCC_51524_2
c_000000000001 . CDS 2921 3157 . + . ID=ATCC_51524_3
c_000000000001 . CDS 3160 4305 . + . ID=ATCC_51524_4
c_000000000001 . CDS 4308 6254 . + . ID=ATCC_51524_5
c_000000000001 . CDS 6349 9012 . + . ID=ATCC_51524_6
c_000000000001 . CDS 9159 9896 . + . ID=ATCC_51524_7
Instead of:
c_000000000001 . CDS 1 1338 . + . ID=1
c_000000000001 . CDS 1569 2705 . + . ID=2
c_000000000001 . CDS 2921 3157 . + . ID=3
c_000000000001 . CDS 3160 4305 . + . ID=4
c_000000000001 . CDS 4308 6254 . + . ID=5
c_000000000001 . CDS 6349 9012 . + . ID=6
c_000000000001 . CDS 9159 9896 . + . ID=7
A follow up question for @meren once we figure out how to generate the rigth gff. files. PPanGGOLiN uses a graphical model and a statistical method to partition gene families in persistent, shell and cloud genomes. My next question would be how to import these partitions back as bins in Anvio.
PPanGGOLiN output looks like this: matrix.csv.txt
Gene
column has the new_id from the .tsv file before and Non-unique Gene name
has the partition info (is this correct @axbazin?)
Hey @IsabelFE, can you please git pull and see if the GFF outputs look more useful?
I did pip install --upgrade git+git://github.com/merenlab/anvio.git
as before but it didn't update.
Anyway, I saw you edited the dbops.py file, so got the new version from GitHub, and the new gffs look like this"
c_000000000001 . CDS 1 1338 . + . ID=Dpig_Prokka___hash2785db7a___1
c_000000000001 . CDS 1569 2705 . + . ID=Dpig_Prokka___hash2785db7a___2
c_000000000001 . CDS 2921 3157 . + . ID=Dpig_Prokka___hash2785db7a___3
c_000000000001 . CDS 3160 4305 . + . ID=Dpig_Prokka___hash2785db7a___4
c_000000000001 . CDS 4308 6254 . + . ID=Dpig_Prokka___hash2785db7a___5
c_000000000001 . CDS 6349 9012 . + . ID=Dpig_Prokka___hash2785db7a___6
c_000000000001 . CDS 9159 9896 . + . ID=Dpig_Prokka___hash2785db7a___7
c_000000000001 . CDS 9992 11287 . - . ID=Dpig_Prokka___hash2785db7a___8
c_000000000001 . CDS 11549 12886 . - . ID=Dpig_Prokka___hash2785db7a___9
c_000000000001 . CDS 13205 13504 . + . ID=Dpig_Prokka___hash2785db7a___10
The problem is that the Dpig_Prokkahash2785db7a1 format doesnt match on the anvi-summarize
output that I will need to use to create the tsv file. Could a column with this richer IDs be added to the anvi-summarize
output?
The _
separator is perfectly fine, I've been using the same thing.
Yes @IsabelFE this is correct. For the matrix.csv file, the Gene
field does store the cluster ID, and the Non-unique Gene name
stores the predicted partition.
The
_
separator is perfectly fine, I've been using the same thing. Yes @IsabelFE this is correct. For the matrix.csv file, theGene
field does store the cluster ID, and theNon-unique Gene name
stores the predicted partition.
@axbazin, OK that's true, the matrix file has one row per gene cluster, not per individual gene. So the Gene
field does store the cluster ID, not the unique gene ID as I indicated.
Yes, sorry. The naming is quite impactical as it is, but we chose to have it as such to match roary's file formating to be compatible with other pangenomic tools that take roary's file format as input. I should probably improve our own wiki about this file to make it clearer.
If you are looking for a partition information for each gene separately, the projection files ( ppanggolin write -p pangenome.h5 --projection
) are one file per genome with some of information for each gene, including its family partition. (and its format is actually detailed in the wiki )
I can see that this requires a bit more work, @IsabelFE.
The gene cluster summary output anvi'o generates is quite comprehensive, but it is missing gene start/stop positions. It is because when a genomes storage database is generated from a set of contigs databases, that information is discarded.
If I change this:
Dpig_Prokka___hash2785db7a___10
to this:
Dpig_Prokka___10
You would have a temporary solution to link everything together since gene clusters file does contain the genome name and gene caller id for each gene cluster. Would you like me to do that?
The real solution is to re-implement the anvi'o's genome storage database, which is something we have been meaning to do but we haven't allocated any time for it.
@meren, I see the problem.
If you change to Dpig_Prokka___10
then is back to the original problem, genes will have a unique ID in each genome, but not across all genomes on a pangenome (Dpig_Prokka
was the pangenome project ID in this case). Is it possible to have something like Dpig_Prokka__GenomeName_10
on the gff files? No need to change anything on the gene cluster summary output, I will just concatenate the columns with the genome_name and gene_callers_id on the anvi-summarize
output.
As I understand hash2785db7a
is just a random ID you are assigning to each genome, but the real genome_name could be used instead.
@IsabelFE,
I actually was generating the GFF output files using this command:
anvi-get-sequences-for-gene-calls -c CONTIGS.db -o gene_calls.gff --export-gff
So it is supposed to be run on every genome (contigs db) separately. In that case Dpig_Prokka
part will be matching to the genome_name
column in the summary output. The hash is a unique identifier of the contigs database, but it is not listed in the gene cluster summary output.
I am running it on every genome (contigs db) separately using a loop:
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_o="analysis_PPanGGOLiN/Anvio7_Exported_Anno"
for file in $path_f/*.db; do
FILENAME=`basename ${file%.*}`
anvi-get-sequences-for-gene-calls -c $file --export-gff3 \
-o $path_o/Anvio7_$FILENAME.gff
done
I think that the problem is on the loop I used to create the contigs.db:
path_f="analysis_Anvio7/Reformat_Dpig"
path_o="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_e="analysis_Anvio7/Parsed_prokka_Dpig"
for file in $path_f/*.fa; do
FILENAME=`basename ${file%.*}`
anvi-gen-contigs-database -f $file \
-o $path_o/$FILENAME.db \
--external-gene-calls $path_e/calls_prokka_$FILENAME.txt \
--ignore-internal-stop-codons \
-n Dpig_Prokka;
done
I am using -n Dpig_Prokka, instead of n $FILENAME, so all my genomes got named the same. 🤦🏻♀️
I think that I am going to need to run the whole analysis again...
I ran the whole pipeline again and have the new contig databases with the correct names. Now my gffs look like this:
c_000000000001 . CDS 3654 4955 . - . ID=ATCC_51524___hash349ce8f2___1
c_000000000001 . CDS 5083 6612 . - . ID=ATCC_51524___hash349ce8f2___2
c_000000000001 . CDS 6612 7370 . - . ID=ATCC_51524___hash349ce8f2___3
c_000000000001 . CDS 7460 8644 . - . ID=ATCC_51524___hash349ce8f2___4
@meren could you please make anvi-get-sequences-for-gene-calls
not to add the hash349ce8f2 part as you previously suggested?
Thanks for all the help!!
Done :)
Thanks @meren!!
Now my gffs look like this:
c_000000000001 . CDS 3654 4955 . - . ID=ATCC_51524___1
c_000000000001 . CDS 5083 6612 . - . ID=ATCC_51524___2
c_000000000001 . CDS 6612 7370 . - . ID=ATCC_51524___3
c_000000000001 . CDS 7460 8644 . - . ID=ATCC_51524___4
c_000000000001 . CDS 8772 9767 . - . ID=ATCC_51524___5
And my tsv file for ppanggolin cluster
has compatible IDs:
GC_00000001 ATCC_51524___900
GC_00000001 KPL1914___1178
GC_00000001 KPL1914___475
GC_00000001 KPL1922_CDC39_95___1080
GC_00000001 KPL1922_CDC39_95___135
However, I still got an error from PPAnGGOLiN:
(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-02-26 10:47:31 main.py:l180 INFO Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-02-26 10:47:31 main.py:l181 INFO PPanGGOLiN version: 1.1.132
2021-02-26 10:47:31 readBinaries.py:l37 INFO Getting the current pangenome's status
2021-02-26 10:47:31 readBinaries.py:l294 INFO Reading pangenome annotations...
100%|█████████████████████████████████████████████████████████████████████████████████████| 49587/49587 [00:00<00:00, 201335.07gene/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 18.40organism/s]
2021-02-26 10:47:33 cluster.py:l233 INFO Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
100%|███████████████████████████████████████████████████████████████████████████████| 1478457/1478457 [00:00<00:00, 2474952.77bytes/s]
Traceback (most recent call last):
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
sys.exit(main())
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
ppanggolin.cluster.launch(args)
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 288, in launch
readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 272, in readClustering
raise Exception(f"Some genes ({nbGeneWtFam}) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.")
Exception: Some genes (49412) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.
@axbazin, it looks like it is not finding the clusters on the tsv file for any of the genes (49412). Is this because Anvi'o names every cluster with a unique ID (like GC_00000001
), while PPanGGOLiN uses the gene ID of one of the members of the cluster to name the whole cluster (like ATCC_51524___900
)?
This is my code and files for the PPanGGOLiN analysis
ppanggolin annotate --anno analysis_PPanGGOLiN_Anvio7/Anvio7GenomesExported.gff.txt --fasta analysis_PPanGGOLiN_Anvio7/Anvio7GenomesReformatted.fa.txt -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --basename FromAnvio7
ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
The fact that PPanGGOLiN uses gene IDs by default should not create problems. We do this to "have a unique ID" but anything could be used in theory.
This is a little puzzling, I will look at it more closely as soon as I have access to a linux machine.
Hello,
It turned out that there was a slight problem in the log here. The number reported is the number of genes that were associated to a family, and not the opposite. I've fixed that in the master branch. Sorry for the inconvenience.
There are only 175 genes without families, I've listed them in genes_without_families.txt
Adelme
Hello,
I update to the new version and run it again:
(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l180 INFO Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l181 INFO PPanGGOLiN version: 1.1.140
2021-03-01 08:53:53 readBinaries.py:l37 INFO Getting the current pangenome's status
2021-03-01 08:53:53 readBinaries.py:l294 INFO Reading pangenome annotations...
100%|███████████████████████████████| 49587/49587 [00:00<00:00, 111534.44gene/s]
100%|█████████████████████████████████████| 28/28 [00:01<00:00, 15.73organism/s]
2021-03-01 08:53:56 cluster.py:l235 INFO Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
0%| | 0/1478457 [00:00<?, ?bytes/s]Traceback (most recent call last):
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 257, in readClustering
nbGeneWithFam+=1
UnboundLocalError: local variable 'nbGeneWithFam' referenced before assignment
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
sys.exit(main())
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
ppanggolin.cluster.launch(args)
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 290, in launch
readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 264, in readClustering
raise Exception(f"line {lineCounter} of the file '{families_tsv_file.name}' raised an error.")
Exception: line 1 of the file 'analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv' raised an error.
0%| | 29/1478457 [00:00<3:15:44, 125.89bytes/s]
That was a bad fix. It should be better now.
Thanks @axbazin It worked!! 🥳
I have the whole PPanGGOLiN pipeline working. I was able to write all the outputs and also run the RGPs without any problem.
There is just one last thing that could make the outputs even more useful. In the exported gffs from Anvio, the functional annotations are not present, therefore all the outputs in PPanGGOLiN have only the gene IDs. It would be much better to be able to explore the partition graph and the spots plots with the functional annotation on them.
For example, this is one of the hotspots using the default PPanGGOLiN with Prokka annotated gffs:
And this is a hotspot now, only with the gene IDs, but no annotations.
For exploration purposes having the annotations will be great. @meren is there any way to export the gffs with functional annotations on them? This is not 100% needed, but it would be cool.
A big thanks to both of you, @axbazin and @meren, both PPanGGOLiN and Anvi'o have been really useful for our research and now going from one output to the other is much easier since the gene IDs and cluster IDs are the same in both of them. 😊
I'm so glad to see you're making progress, @IsabelFE. Thank you, also, @axbazin.
It is doable to export gene functional annotations in GFF files, @IsabelFE and I can take a look at that. Would you mind submitting an issue for this to anvi'o GitHub and in which mention the URL for this issue so there is context?
Thank you,
@meren, I submitted a new issue to Anvi'o GitHub. @axbazin, I think that you can close this issue if you want since everything works now to import from Anvi'o into PPanGGOLiN.
Thanks again to both of you.
Thank you for your very clear and detailed bug reports and answers !
@axbazin, I think we closed this issue too soon... I found some issued with the annotations.
When I look at the spot plots or at the matrix file output genes have an ID like this: KPL3070_CDS_0757, but this ID does not correspond with the ID on the .gff files:
c_000000000001 . CDS 789846 792719 . + . ID=KPL3070___757;Name=COG0860;db_xref=COG20:COG0860;product=N-acetylmuramoyl-L-alanine amidase (AmiC) (PDB:1JWQ)
c_000000000001 . CDS 792855 793406 . + . ID=KPL3070___758
c_000000000001 . CDS 793384 794352 . + . ID=KPL3070___759;Name=COG2264;db_xref=COG20:COG2264;product=Ribosomal protein L11 methylase PrmA (PrmA) (PDB:3GRZ)
For example, KPL3070_CDS_0757, which based on the spot plot corresponds to COG2264, is not equal to KPL3070757, it is in fact KPL3070759 (that as you can see it is COG2264). It looks like instead of using the information from the Anvio .gff file PPanGGOLiN is doing CDS search again. This goes to my initial post in this issue:
The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio.
It looks like PPanGGOLiN is generating again new IDs for each gene (the ones in the KPL3070_CDS_0757 format), instead of using the ones from the gff. Therefore, on the PPanGGOLiN output, the gene clusters IDs match the ones from Anvio, but if you look at the individual gene IDs, the numbers are off.
Thanks again!
The funny thing is that the info on the gephi table has the Anvio IDs, but in the matrix file you get the other type of IDs that don't have the same gene_callers_id. gephi_table.xlsx matrix.xlsx
I realized this because I am trying to draw a subgraph for some hotspots using ppanggolin align
and I need to provide the sequence of the flanking proteins. When searching for those sequences on the Anvio output I was searching for KPL3070_CDS_0757, this is genome_name=="KPL3070" & gene_callers_id=="757", but soon I realized that what I need is indeed genome_name=="KPL3070" & gene_callers_id=="759"
Could be possible to get the KPL3070___757 annotations on the spots plots? That way, when I find a spot of interest I can find the corresponding protein sequences.
Best,
In the case of provided annotation, PPanGGOLiN uses the ids in the genome files but also creates internal gene IDs (just in case there are genes with identical ids in different genomes AND PPanGGOLiN is the one performing the clustering afterwards). What I am guessing is happening here is that those internal IDs are reported instead of the user's.
I will look at this ASAP. It is definitely possible to get the right labels on the plot. I notice also that there are graphical issues with margins when IDs are quite long, I'll see if I can do something about this too.
Thanks for looking into this and reopening the issue. I see the point with the genome files IDs vs the PPanGGOLiN internal gene IDs.
I also found another issue with the current annotations on the spot plots. For some genes, only the COG ID is plotted, but a COG ID can be assigned to more than one GC, so ideally having both the KPL3070757 and the COG ID will be perfect. But I understand that from a graphical point of view this is too long. Therefore, having only the gene ID as KPL3070757 (not the internal IDs) is the most useful. I could easily search for the ID either on the gephi_table.xlsx or on the anvio summary table and found the COG ID or the other annotation info.
Thanks!
@axbazin while you fix this I moved ahead and manually found the flanking proteins for my spot of interest and ran ppanggolin align
. I got this error:
ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output
2021-03-16 14:23:03 main.py:l180 INFO Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output
2021-03-16 14:23:03 main.py:l181 INFO PPanGGOLiN version: 1.1.141
2021-03-16 14:23:03 readBinaries.py:l37 INFO Getting the current pangenome's status
Traceback (most recent call last):
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
sys.exit(main())
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 205, in main
ppanggolin.align.launch(args)
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 343, in launch
align(pangenome = pangenome, proteinFile = args.proteins, output = args.output, tmpdir = args.tmpdir, identity = args.identity, coverage =args.coverage, defrag =args.defrag,cpu= args.cpu, getinfo =args.getinfo, draw_related = args.draw_related )
File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 309, in align
raise Exception("Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.")
Exception: Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.
Do you have plans to add this functionality for externally clustered gene families? 🙏🏻
For the last thing that you reported, it is probably possible, though that might require some work. I'll see if I can do something about this in the near futur. It would indeed be quite practical.
Hello,
In the latest commits I added a new options so that you can choose what is used as label in the figures in priority. This will put gene IDs as labels:
ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures
If there is a label indicated in the case of provided annotations like yours, it should always be reported in priority in the figures This should solve your first issue.
As another example, this command will display the behavior of showing gene names if they exist, or gene ID if there are no gene names:
ppanggolin spot -p pangenome.h5 --label_priority name,ID --draw_hotspots --output figures
Also, to maybe make things a bit easier, I've updated the '--interest' option to add elements of interest to the figure file names, if you want to see all spots with a given annotation, for example:
ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures --interest COG2264
should indicate which spots have genes with the annotation 'COG2264', either in the spot or as flanking persistent genes. This should work with family id, gene id, gene names and strings that can be found in the 'product' field of the annotation files.
We're looking to update this feature of drawing spots in the futur, as we're not entirely happy with how it renders currently, so it might evolve.
Adelme
Hi Adelme,
This is really cool! It is great to be able to explore the plots with the name, family, or ID options. Makes going back and forward with Anvio much easier. I love the --interest option, it is a really neat feature! I am already playing with it.
Thanks,
Isabel
Hi, I'll close this issue since problems have more or less been resolved. Adelme
Hello @labgem and @merenlab (@meren),
I've been using both PPAnGGOLiN and Anvio pangenomic pipelines and really loving both tools. I've used both tools independently with Prokka annotated .gff files without a problem. But now I want to import Anvio clusters into PPAnGGOLiN instead of using the default MMseqs2 clustering. The reason for this is that being able to visualize the same gene clusters with both methods would be really useful for our research. It is difficult to make sense of the data with 2 independent clustering methods since it is like comparing apples and oranges. We are able to make really cool observations with each method but we can't compare with the other one.
In order to import the Anvio clustering into PPanGGOLiN I need:
A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files. Using
anvi-summarize
I got the info needed to generate the .tsv file.The annotated genomes with gene IDs that match the ones listed in the previous .tsv file. The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script
gff_parser.py
. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However,gff_parser.py
will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio. I found out thatanvi-get-sequences-for-gene-calls
can be used to export new .fasta and .gff files with only the ORFs that match the gene IDs on the .tsv file. But I think that there is an issue with the formating of these .gff files not being compatible with the expected .gff files on PPanGGOLiNI tried running:
ppanggolin annotate --anno Anvio7GenomesAnno.txt --fasta Anvio7GenomesFasta.txt
And I got an error that I am not sure if it is related to PPanGGOLiN or to the format of the .gff/.fasta files obtained from Anvio.
I hope someone from one of your teams can help me with this. Really it would be really cool to have both tools on the same set of gene clusters.
Thanks a lot,
Isabel