Importing Anvio Pangenomic gene clusters into PPAnGGOLiN

IsabelFE commented 3 years ago

Hello @labgem and @merenlab (@meren),

I've been using both PPAnGGOLiN and Anvio pangenomic pipelines and really loving both tools. I've used both tools independently with Prokka annotated .gff files without a problem. But now I want to import Anvio clusters into PPAnGGOLiN instead of using the default MMseqs2 clustering. The reason for this is that being able to visualize the same gene clusters with both methods would be really useful for our research. It is difficult to make sense of the data with 2 independent clustering methods since it is like comparing apples and oranges. We are able to make really cool observations with each method but we can't compare with the other one.

In order to import the Anvio clustering into PPanGGOLiN I need:

A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files. Using anvi-summarize I got the info needed to generate the .tsv file.
The annotated genomes with gene IDs that match the ones listed in the previous .tsv file. The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio. I found out that anvi-get-sequences-for-gene-calls can be used to export new .fasta and .gff files with only the ORFs that match the gene IDs on the .tsv file. But I think that there is an issue with the formating of these .gff files not being compatible with the expected .gff files on PPanGGOLiN

I tried running: ppanggolin annotate --anno Anvio7GenomesAnno.txt --fasta Anvio7GenomesFasta.txt

And I got an error that I am not sure if it is related to PPanGGOLiN or to the format of the .gff/.fasta files obtained from Anvio.

2021-02-19 13:37:26 main.py:l180 INFO   Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin annotate --anno analysis_PPanGGOLiN/Anvio7GenomesAnno.txt --fasta analysis_PPanGGOLiN/Anvio7GenomesFasta.txt -o analysis_PPanGGOLiN/OutputFromAnvio7 --basename FromAnvio7
2021-02-19 13:37:26 main.py:l181 INFO   PPanGGOLiN version: 1.1.131
2021-02-19 13:37:26 annotate.py:l337 INFO   Reading analysis_PPanGGOLiN/Anvio7GenomesAnno.txt the list of organism files ...
  0%|                                                                                                                                 | 0/28 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 325, in readAnnoFile
    return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 277, in read_org_gff
    if contig.name != gff_fields[GFF_seqname]:
UnboundLocalError: local variable 'contig' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 319, in launchReadAnno
    return readAnnoFile(*args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 327, in readAnnoFile
    raise Exception(f"Reading the gff3 file '{filename}' raised an error.")
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 183, in main
    ppanggolin.annotate.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 428, in launch
    readAnnotations(pangenome, args.anno, cpu = args.cpu, pseudo = args.use_pseudo, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 349, in readAnnotations
    for org, flag in p.imap_unordered(launchReadAnno, args):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 761, in next
    raise value
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
  0%|

I hope someone from one of your teams can help me with this. Really it would be really cool to have both tools on the same set of gene clusters.

Thanks a lot,

Isabel

IsabelFE commented 3 years ago

I am attaching the files I am using:

The .tsv file created from the anvi-summarize info: Clusters_Dpig_Anvio7.tsv.txt

The list files for the .fasta and .gff files: Anvio7GenomesAnno.txt Anvio7GenomesFasta.txt

The .fasta and .gff files exported from Anvio: Anvio7_Exported_Genomes.zip

axbazin commented 3 years ago

Hello, Your error "UnboundLocalError: local variable 'contig' referenced before assignment" is quite funnily related to an issue that was fixed today (the fact of not having '##sequence-region' at the beginning of the gff), so you'll need to install the github version of ppanggolin to have it work. However, the genomes in the first gff that I opened "Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff" is extremely strange. All genes start on position 1 even though they are on the same dna molecule 'c_000000000001'. This does not sound right to me. Unless I am missreading something, there is probably a problem there.

Adelme

IsabelFE commented 3 years ago

Hi,

Yeah, I saw the issue with not having the '##sequence-region'. I will install the new version. I also thought that the .gff exported from Anvio look weird, it is like they have the right end positions, but they all start on position 1. @meren is this an issue with Anvio not exporting the gff properly?

Thanks @axbazin

meren commented 3 years ago

Dear @axbazin and @IsabelFE,

Apologies for the late response. I finally had a chance to take look at the export-gff function anvi-get-sequences-for-gene-calls is using and found a bug we recently introduced. I now have fixed the issue via https://github.com/merenlab/anvio/commit/a9f73284224f12024247c8b46f5f268ed929733b in our main branch, which means, if you are tracking the main branch, your new GFF files should look much better, @IsabelFE.

Thank you very much for your patience and investigating how to make these two tools talk. Please let me know if you run into other problems and I will be happy to work on them on our end. Using your experience and expertise, we can even implement an anvi'o script to export everything necessary from an anvi'o pangenome to be able to continue working with it in PPAnGGOLiN.

Best wishes,

axbazin commented 3 years ago

Hello,

Watching more carefully the files, there might be one last trouble. To be able to differentiate the genes within the cluster file, genes must have a unique identifier throughout all of the gff files and in the cluster file. As it stands it seems that Anvi'o gives incremental gene identifiers from 1 to n in each genome, which will be problematic in differentiating genes in between the different genomes.

A possible solution would be to give unique IDs to each gene and generate the cluster file with those, but I am not very familiar with Anvi'o so I do not know how doable it is.

Of course an ideal solution for this time and the next ones would most likely to have a script such as @meren suggested, that generates the gff and the tsv files with unique identifiers.

Adelme

meren commented 3 years ago

We certainly can fix that :) If I have access to some 'ideal input files' from @IsabelFE, @merenlab can use that as a model to develop a tool.

IsabelFE commented 3 years ago

Dear @axbazin and @meren,

Thanks a lot for helping solve the compatibility between your tools. I updated Anvio to the main GitHub branch version and the issue with the gffs is solved. As @axbazin indicated I got an error about the gene IDs.

From the anvi-summarize output I generated the .tsv file on R like this:

Dpig_Anvio7 <- read_delim("analysis_Anvio7/Pangenomic_Results_Dpig/Dpig-PAN-SUMMARY/PAN_DPIG_prokka_gene_clusters_summary.txt.gz", "\t")
Dpig_Anvio7 <- Dpig_Anvio7 %>% 
  unite(new_id, genome_name:gene_callers_id, sep = "_", remove = FALSE)
Clusters_Dpig_Anvio7 <- select(Dpig_Anvio7, c(gene_cluster_id, new_id))

The output looks like this Clusters_Dpig_Anvio7.tsv.txt:

GC_00000001 ATCC_51524_900
GC_00000001 KPL1914_1178
GC_00000001 KPL1914_475
GC_00000001 KPL1922_CDC39_95_1080
GC_00000001 KPL1922_CDC39_95_135

This is, the first column is the gene_cluster_id and the second is new_id (concatenated genome_name + gene_callersid, separated by `) (Should I have separated them with something different from_`??)

@meren, is there a way to export the .gffs using this new_id format instead of the gene_callers_id?

Like:

c_000000000001  .   CDS 1   1338    .   +   .   ID=ATCC_51524_1
c_000000000001  .   CDS 1569    2705    .   +   .   ID=ATCC_51524_2
c_000000000001  .   CDS 2921    3157    .   +   .   ID=ATCC_51524_3
c_000000000001  .   CDS 3160    4305    .   +   .   ID=ATCC_51524_4
c_000000000001  .   CDS 4308    6254    .   +   .   ID=ATCC_51524_5
c_000000000001  .   CDS 6349    9012    .   +   .   ID=ATCC_51524_6
c_000000000001  .   CDS 9159    9896    .   +   .   ID=ATCC_51524_7

Instead of:

c_000000000001  .   CDS 1   1338    .   +   .   ID=1
c_000000000001  .   CDS 1569    2705    .   +   .   ID=2
c_000000000001  .   CDS 2921    3157    .   +   .   ID=3
c_000000000001  .   CDS 3160    4305    .   +   .   ID=4
c_000000000001  .   CDS 4308    6254    .   +   .   ID=5
c_000000000001  .   CDS 6349    9012    .   +   .   ID=6
c_000000000001  .   CDS 9159    9896    .   +   .   ID=7

IsabelFE commented 3 years ago

A follow up question for @meren once we figure out how to generate the rigth gff. files. PPanGGOLiN uses a graphical model and a statistical method to partition gene families in persistent, shell and cloud genomes. My next question would be how to import these partitions back as bins in Anvio. PPanGGOLiN output looks like this: matrix.csv.txt Gene column has the new_id from the .tsv file before and Non-unique Gene name has the partition info (is this correct @axbazin?)

meren commented 3 years ago

Hey @IsabelFE, can you please git pull and see if the GFF outputs look more useful?

IsabelFE commented 3 years ago

I did pip install --upgrade git+git://github.com/merenlab/anvio.git as before but it didn't update. Anyway, I saw you edited the dbops.py file, so got the new version from GitHub, and the new gffs look like this"

c_000000000001  .   CDS 1   1338    .   +   .   ID=Dpig_Prokka___hash2785db7a___1
c_000000000001  .   CDS 1569    2705    .   +   .   ID=Dpig_Prokka___hash2785db7a___2
c_000000000001  .   CDS 2921    3157    .   +   .   ID=Dpig_Prokka___hash2785db7a___3
c_000000000001  .   CDS 3160    4305    .   +   .   ID=Dpig_Prokka___hash2785db7a___4
c_000000000001  .   CDS 4308    6254    .   +   .   ID=Dpig_Prokka___hash2785db7a___5
c_000000000001  .   CDS 6349    9012    .   +   .   ID=Dpig_Prokka___hash2785db7a___6
c_000000000001  .   CDS 9159    9896    .   +   .   ID=Dpig_Prokka___hash2785db7a___7
c_000000000001  .   CDS 9992    11287   .   -   .   ID=Dpig_Prokka___hash2785db7a___8
c_000000000001  .   CDS 11549   12886   .   -   .   ID=Dpig_Prokka___hash2785db7a___9
c_000000000001  .   CDS 13205   13504   .   +   .   ID=Dpig_Prokka___hash2785db7a___10

The problem is that the Dpig_Prokkahash2785db7a1 format doesnt match on the anvi-summarize output that I will need to use to create the tsv file. Could a column with this richer IDs be added to the anvi-summarize output?

axbazin commented 3 years ago

The _ separator is perfectly fine, I've been using the same thing. Yes @IsabelFE this is correct. For the matrix.csv file, the Gene field does store the cluster ID, and the Non-unique Gene name stores the predicted partition.

IsabelFE commented 3 years ago

The _ separator is perfectly fine, I've been using the same thing. Yes @IsabelFE this is correct. For the matrix.csv file, the Gene field does store the cluster ID, and the Non-unique Gene name stores the predicted partition.

@axbazin, OK that's true, the matrix file has one row per gene cluster, not per individual gene. So the Gene field does store the cluster ID, not the unique gene ID as I indicated.

axbazin commented 3 years ago

Yes, sorry. The naming is quite impactical as it is, but we chose to have it as such to match roary's file formating to be compatible with other pangenomic tools that take roary's file format as input. I should probably improve our own wiki about this file to make it clearer.

If you are looking for a partition information for each gene separately, the projection files ( ppanggolin write -p pangenome.h5 --projection ) are one file per genome with some of information for each gene, including its family partition. (and its format is actually detailed in the wiki )

meren commented 3 years ago

I can see that this requires a bit more work, @IsabelFE.

The gene cluster summary output anvi'o generates is quite comprehensive, but it is missing gene start/stop positions. It is because when a genomes storage database is generated from a set of contigs databases, that information is discarded.

If I change this:

Dpig_Prokka___hash2785db7a___10

to this:

Dpig_Prokka___10

You would have a temporary solution to link everything together since gene clusters file does contain the genome name and gene caller id for each gene cluster. Would you like me to do that?

The real solution is to re-implement the anvi'o's genome storage database, which is something we have been meaning to do but we haven't allocated any time for it.

IsabelFE commented 3 years ago

@meren, I see the problem.

If you change to Dpig_Prokka___10 then is back to the original problem, genes will have a unique ID in each genome, but not across all genomes on a pangenome (Dpig_Prokka was the pangenome project ID in this case). Is it possible to have something like Dpig_Prokka__GenomeName_10 on the gff files? No need to change anything on the gene cluster summary output, I will just concatenate the columns with the genome_name and gene_callers_id on the anvi-summarize output.

As I understand hash2785db7a is just a random ID you are assigning to each genome, but the real genome_name could be used instead.

meren commented 3 years ago

@IsabelFE,

I actually was generating the GFF output files using this command:

anvi-get-sequences-for-gene-calls -c CONTIGS.db -o gene_calls.gff --export-gff

So it is supposed to be run on every genome (contigs db) separately. In that case Dpig_Prokka part will be matching to the genome_name column in the summary output. The hash is a unique identifier of the contigs database, but it is not listed in the gene cluster summary output.

IsabelFE commented 3 years ago

I am running it on every genome (contigs db) separately using a loop:

path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_o="analysis_PPanGGOLiN/Anvio7_Exported_Anno"

for file in $path_f/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-get-sequences-for-gene-calls -c $file --export-gff3 \
                                      -o $path_o/Anvio7_$FILENAME.gff

done

I think that the problem is on the loop I used to create the contigs.db:

path_f="analysis_Anvio7/Reformat_Dpig"
path_o="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_e="analysis_Anvio7/Parsed_prokka_Dpig"

for file in $path_f/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_e/calls_prokka_$FILENAME.txt \
                              --ignore-internal-stop-codons \
                              -n Dpig_Prokka;
done

I am using -n Dpig_Prokka, instead of n $FILENAME, so all my genomes got named the same. 🤦🏻‍♀️

I think that I am going to need to run the whole analysis again...

IsabelFE commented 3 years ago

I ran the whole pipeline again and have the new contig databases with the correct names. Now my gffs look like this:

c_000000000001  .   CDS 3654    4955    .   -   .   ID=ATCC_51524___hash349ce8f2___1
c_000000000001  .   CDS 5083    6612    .   -   .   ID=ATCC_51524___hash349ce8f2___2
c_000000000001  .   CDS 6612    7370    .   -   .   ID=ATCC_51524___hash349ce8f2___3
c_000000000001  .   CDS 7460    8644    .   -   .   ID=ATCC_51524___hash349ce8f2___4

@meren could you please make anvi-get-sequences-for-gene-calls not to add the hash349ce8f2 part as you previously suggested?

Thanks for all the help!!

meren commented 3 years ago

Done :)

IsabelFE commented 3 years ago

Thanks @meren!!

Now my gffs look like this:

c_000000000001  .   CDS 3654    4955    .   -   .   ID=ATCC_51524___1
c_000000000001  .   CDS 5083    6612    .   -   .   ID=ATCC_51524___2
c_000000000001  .   CDS 6612    7370    .   -   .   ID=ATCC_51524___3
c_000000000001  .   CDS 7460    8644    .   -   .   ID=ATCC_51524___4
c_000000000001  .   CDS 8772    9767    .   -   .   ID=ATCC_51524___5

And my tsv file for ppanggolin cluster has compatible IDs:

GC_00000001 ATCC_51524___900
GC_00000001 KPL1914___1178
GC_00000001 KPL1914___475
GC_00000001 KPL1922_CDC39_95___1080
GC_00000001 KPL1922_CDC39_95___135

However, I still got an error from PPAnGGOLiN:

(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv

2021-02-26 10:47:31 main.py:l180 INFO   Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-02-26 10:47:31 main.py:l181 INFO   PPanGGOLiN version: 1.1.132
2021-02-26 10:47:31 readBinaries.py:l37 INFO    Getting the current pangenome's status
2021-02-26 10:47:31 readBinaries.py:l294 INFO   Reading pangenome annotations...
100%|█████████████████████████████████████████████████████████████████████████████████████| 49587/49587 [00:00<00:00, 201335.07gene/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 18.40organism/s]
2021-02-26 10:47:33 cluster.py:l233 INFO    Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
100%|███████████████████████████████████████████████████████████████████████████████| 1478457/1478457 [00:00<00:00, 2474952.77bytes/s]
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
    ppanggolin.cluster.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 288, in launch
    readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 272, in readClustering
    raise Exception(f"Some genes ({nbGeneWtFam}) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.")
Exception: Some genes (49412) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.

@axbazin, it looks like it is not finding the clusters on the tsv file for any of the genes (49412). Is this because Anvi'o names every cluster with a unique ID (like GC_00000001), while PPanGGOLiN uses the gene ID of one of the members of the cluster to name the whole cluster (like ATCC_51524___900)?

IsabelFE commented 3 years ago

This is my code and files for the PPanGGOLiN analysis

ppanggolin annotate --anno analysis_PPanGGOLiN_Anvio7/Anvio7GenomesExported.gff.txt --fasta analysis_PPanGGOLiN_Anvio7/Anvio7GenomesReformatted.fa.txt -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --basename FromAnvio7

ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv

analysis_PPanGGOLiN_Anvio7.zip

axbazin commented 3 years ago

The fact that PPanGGOLiN uses gene IDs by default should not create problems. We do this to "have a unique ID" but anything could be used in theory.

This is a little puzzling, I will look at it more closely as soon as I have access to a linux machine.

axbazin commented 3 years ago

Hello,

It turned out that there was a slight problem in the log here. The number reported is the number of genes that were associated to a family, and not the opposite. I've fixed that in the master branch. Sorry for the inconvenience.

There are only 175 genes without families, I've listed them in genes_without_families.txt

Adelme

IsabelFE commented 3 years ago

Hello,

I update to the new version and run it again:

(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l180 INFO   Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l181 INFO   PPanGGOLiN version: 1.1.140
2021-03-01 08:53:53 readBinaries.py:l37 INFO    Getting the current pangenome's status
2021-03-01 08:53:53 readBinaries.py:l294 INFO   Reading pangenome annotations...
100%|███████████████████████████████| 49587/49587 [00:00<00:00, 111534.44gene/s]
100%|█████████████████████████████████████| 28/28 [00:01<00:00, 15.73organism/s]
2021-03-01 08:53:56 cluster.py:l235 INFO    Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
  0%|                                            | 0/1478457 [00:00<?, ?bytes/s]Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 257, in readClustering
    nbGeneWithFam+=1
UnboundLocalError: local variable 'nbGeneWithFam' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
    ppanggolin.cluster.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 290, in launch
    readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 264, in readClustering
    raise Exception(f"line {lineCounter} of the file '{families_tsv_file.name}' raised an error.")
Exception: line 1 of the file 'analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv' raised an error.
  0%|                                | 29/1478457 [00:00<3:15:44, 125.89bytes/s]

axbazin commented 3 years ago

That was a bad fix. It should be better now.

IsabelFE commented 3 years ago

Thanks @axbazin It worked!! 🥳

I have the whole PPanGGOLiN pipeline working. I was able to write all the outputs and also run the RGPs without any problem.

There is just one last thing that could make the outputs even more useful. In the exported gffs from Anvio, the functional annotations are not present, therefore all the outputs in PPanGGOLiN have only the gene IDs. It would be much better to be able to explore the partition graph and the spots plots with the functional annotation on them.

For example, this is one of the hotspots using the default PPanGGOLiN with Prokka annotated gffs: spot_1

And this is a hotspot now, only with the gene IDs, but no annotations. spot_2

For exploration purposes having the annotations will be great. @meren is there any way to export the gffs with functional annotations on them? This is not 100% needed, but it would be cool.

A big thanks to both of you, @axbazin and @meren, both PPanGGOLiN and Anvi'o have been really useful for our research and now going from one output to the other is much easier since the gene IDs and cluster IDs are the same in both of them. 😊

meren commented 3 years ago

I'm so glad to see you're making progress, @IsabelFE. Thank you, also, @axbazin.

It is doable to export gene functional annotations in GFF files, @IsabelFE and I can take a look at that. Would you mind submitting an issue for this to anvi'o GitHub and in which mention the URL for this issue so there is context?

Thank you,

IsabelFE commented 3 years ago

@meren, I submitted a new issue to Anvi'o GitHub. @axbazin, I think that you can close this issue if you want since everything works now to import from Anvi'o into PPanGGOLiN.

Thanks again to both of you.

axbazin commented 3 years ago

Thank you for your very clear and detailed bug reports and answers !

IsabelFE commented 3 years ago

@axbazin, I think we closed this issue too soon... I found some issued with the annotations.

When I look at the spot plots or at the matrix file output genes have an ID like this: KPL3070_CDS_0757, but this ID does not correspond with the ID on the .gff files:

c_000000000001  .   CDS 789846  792719  .   +   .   ID=KPL3070___757;Name=COG0860;db_xref=COG20:COG0860;product=N-acetylmuramoyl-L-alanine amidase (AmiC) (PDB:1JWQ)
c_000000000001  .   CDS 792855  793406  .   +   .   ID=KPL3070___758
c_000000000001  .   CDS 793384  794352  .   +   .   ID=KPL3070___759;Name=COG2264;db_xref=COG20:COG2264;product=Ribosomal protein L11 methylase PrmA (PrmA) (PDB:3GRZ)

For example, KPL3070_CDS_0757, which based on the spot plot corresponds to COG2264, is not equal to KPL3070757, it is in fact KPL3070759 (that as you can see it is COG2264). It looks like instead of using the information from the Anvio .gff file PPanGGOLiN is doing CDS search again. This goes to my initial post in this issue:

The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio.

It looks like PPanGGOLiN is generating again new IDs for each gene (the ones in the KPL3070_CDS_0757 format), instead of using the ones from the gff. Therefore, on the PPanGGOLiN output, the gene clusters IDs match the ones from Anvio, but if you look at the individual gene IDs, the numbers are off.

Thanks again!

IsabelFE commented 3 years ago

The funny thing is that the info on the gephi table has the Anvio IDs, but in the matrix file you get the other type of IDs that don't have the same gene_callers_id. gephi_table.xlsx matrix.xlsx

I realized this because I am trying to draw a subgraph for some hotspots using ppanggolin align and I need to provide the sequence of the flanking proteins. When searching for those sequences on the Anvio output I was searching for KPL3070_CDS_0757, this is genome_name=="KPL3070" & gene_callers_id=="757", but soon I realized that what I need is indeed genome_name=="KPL3070" & gene_callers_id=="759"

Could be possible to get the KPL3070___757 annotations on the spots plots? That way, when I find a spot of interest I can find the corresponding protein sequences.

Best,

axbazin commented 3 years ago

In the case of provided annotation, PPanGGOLiN uses the ids in the genome files but also creates internal gene IDs (just in case there are genes with identical ids in different genomes AND PPanGGOLiN is the one performing the clustering afterwards). What I am guessing is happening here is that those internal IDs are reported instead of the user's.

I will look at this ASAP. It is definitely possible to get the right labels on the plot. I notice also that there are graphical issues with margins when IDs are quite long, I'll see if I can do something about this too.

IsabelFE commented 3 years ago

Thanks for looking into this and reopening the issue. I see the point with the genome files IDs vs the PPanGGOLiN internal gene IDs.

I also found another issue with the current annotations on the spot plots. For some genes, only the COG ID is plotted, but a COG ID can be assigned to more than one GC, so ideally having both the KPL3070757 and the COG ID will be perfect. But I understand that from a graphical point of view this is too long. Therefore, having only the gene ID as KPL3070757 (not the internal IDs) is the most useful. I could easily search for the ID either on the gephi_table.xlsx or on the anvio summary table and found the COG ID or the other annotation info.

Thanks!

IsabelFE commented 3 years ago

@axbazin while you fix this I moved ahead and manually found the flanking proteins for my spot of interest and ran ppanggolin align. I got this error:

ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output

2021-03-16 14:23:03 main.py:l180 INFO   Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output
2021-03-16 14:23:03 main.py:l181 INFO   PPanGGOLiN version: 1.1.141
2021-03-16 14:23:03 readBinaries.py:l37 INFO    Getting the current pangenome's status
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 205, in main
    ppanggolin.align.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 343, in launch
    align(pangenome = pangenome, proteinFile = args.proteins, output = args.output, tmpdir = args.tmpdir, identity = args.identity, coverage =args.coverage, defrag =args.defrag,cpu= args.cpu, getinfo =args.getinfo, draw_related = args.draw_related )
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 309, in align
    raise Exception("Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.")
Exception: Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.

Do you have plans to add this functionality for externally clustered gene families? 🙏🏻

axbazin commented 3 years ago

For the last thing that you reported, it is probably possible, though that might require some work. I'll see if I can do something about this in the near futur. It would indeed be quite practical.

axbazin commented 3 years ago

Hello,

In the latest commits I added a new options so that you can choose what is used as label in the figures in priority. This will put gene IDs as labels:

ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures

If there is a label indicated in the case of provided annotations like yours, it should always be reported in priority in the figures This should solve your first issue.

As another example, this command will display the behavior of showing gene names if they exist, or gene ID if there are no gene names:

ppanggolin spot -p pangenome.h5 --label_priority name,ID --draw_hotspots --output figures

Also, to maybe make things a bit easier, I've updated the '--interest' option to add elements of interest to the figure file names, if you want to see all spots with a given annotation, for example:

ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures --interest COG2264

should indicate which spots have genes with the annotation 'COG2264', either in the spot or as flanking persistent genes. This should work with family id, gene id, gene names and strings that can be found in the 'product' field of the annotation files.

We're looking to update this feature of drawing spots in the futur, as we're not entirely happy with how it renders currently, so it might evolve.

Adelme

IsabelFE commented 3 years ago

Hi Adelme,

This is really cool! It is great to be able to explore the plots with the name, family, or ID options. Makes going back and forward with Anvio much easier. I love the --interest option, it is a really neat feature! I am already playing with it.

Thanks,

Isabel

axbazin commented 2 years ago

Hi, I'll close this issue since problems have more or less been resolved. Adelme

labgem / PPanGGOLiN

Importing Anvio Pangenomic gene clusters into PPAnGGOLiN #56