gene IDs from Panaroo (1.3.2) presence/absence matrix file do not match IDs in GFF files

qisun2 commented 1 year ago

I am running panaroo 1.3.2, pagoo 0.3.17, both are latest versions.

I noticed that in the matrix files produced by Panaroo (gene_presence_absence_roary.csv and gene_presence_absence.csv), there are two types of gene IDs that are not present in the GFF files, for example: "607_refound_10510" (I do not know where they come from), and "GCF_004322615.1_00534;GCF_004322615.1_00535" (two genes concatenated and separated by ";").

In my run, there are <2% genes like this. I can run a script to replace these items to "" in order to make panaroo_2_pagoo to run. I am wondering whether panaroo_2_pagoo can have an extra step to filter out genes not in gff, and output a warning message and error log to let users know what genes are removed, so that users can decide whether it is acceptable, or at least it helps to troubleshoot.

iferres commented 1 year ago

Hi @qisun2 , it isn't the case? If I understand it correctly, this part of the function does exactly that: https://github.com/iferres/pagoo/blob/8a0922f548599af4d03dead9ef2580209c357848/R/panaroo_2_pagoo.R#L63-L96 But I may be not understanding correctly. Those genes should be always removed.

qisun2 commented 1 year ago

For the panaroo_2_pagoo tool, the problems were IDs like "GCF_004322615.1_00534;GCF_004322615.1_00535", in which mulitiple IDs are merged into one string separated by ";". In the gff there are two separate CDS "GCF_004322615.1_00534" and "GCF_004322615.1_00535".

As panaroo also output a roary formatted matrix, I was using the roary_2_pagoo tool to process, which has two problems: the concatenated IDs and these "refound" IDs. But I do not have to use roary_2_pagoo

Here is a shortened example of the gene_presence_absence.csv file. The gene that caused the problem is GCF_004322615.1_00534;GCF_004322615.1_00535.

Gene,Non-unique Gene name,Annotation,GCF_004307355.1,GCF_004322615.1,GCF_004322625.1 mhqA_2,mhqA_2,Putative ring-cleaving protein,GCF_004307355.1_00260,GCF_004322615.1_00890,GCF_004322625.1_01819 mrpA_2,mrpA_2,antiporter subunit A,GCF_004307355.1_00598,GCF_004322615.1_00534;GCF_004322615.1_00535,GCF_004322625.1_01524 rocD2,rocD2,Ornithine aminotransferase 2,GCF_004307355.1_00861,GCF_004322615.1_00280,GCF_004322625.1_01261

iferres commented 1 year ago

I think there are some changes in the format respect prior versions of panaroo, last time I checked it was version 1.2.9. I will try to address this next week. Thanks for reporting!

iferres commented 1 year ago

Hi @qisun2 , could you provide a reproducible example to work with? As small as possible, but which contains these weird cases? The gene_presence_absence.csv and the gffs

qisun2 commented 1 year ago

https://cornell.box.com/s/cfg2bxbov8wpcd4x5aakd28entf7tbsm

From: Ignacio Ferrés @.> Sent: Monday, February 6, 2023 2:09 PM To: iferres/pagoo @.> Cc: Qi Sun @.>; Mention @.> Subject: Re: [iferres/pagoo] gene IDs from Panaroo (1.3.2) presence/absence matrix file do not match IDs in GFF files (Issue #60)

Hi @qisun2https://github.com/qisun2 , could you provide a reproducible example to work with? As small as possible, but which contains these weird cases? The gene_presence_absence.csv and the gffs

— Reply to this email directly, view it on GitHubhttps://github.com/iferres/pagoo/issues/60#issuecomment-1419603912, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJYOGOPGC3ZA4JBB6NJXKGTWWFD6FANCNFSM6AAAAAAUQPWSGU. You are receiving this because you were mentioned.Message ID: @.**@.>>

iferres commented 1 year ago

I had no problems reading the files (with or without the gffs):

> gffs <- list.files("gff/", full.names=T)
> gpa <- "gene_presence_absence.csv"
> suppressPackageStartupMessages(library(pagoo))
> p <- panaroo_2_pagoo(gpa)
Reading csv file (panaroo).
Processing csv file.
Loading PgR6M class object.
Checking class.
Checking dimnames.
Creating gid (gene ids).
Checking provided cluster metadata.
Creating panmatrix.
Populating class.
Done.
>  
> p <- panaroo_2_pagoo(gpa, gffs)
Reading csv file (panaroo).
Processing csv file.
Reading gff file gff//GCA_026284395.1.gff
Reading gff file gff//GCA_026327615.1.gff
Reading gff file gff//GCF_000478385.1.gff
Reading gff file gff//GCF_004307335.1.gff
Reading gff file gff//GCF_004307355.1.gff
Reading gff file gff//GCF_004322615.1.gff
Reading gff file gff//GCF_004322625.1.gff
Reading gff file gff//GCF_006382135.1.gff
Reading gff file gff//GCF_006382315.1.gff
Reading gff file gff//GCF_006382545.1.gff
Reading gff file gff//GCF_006406235.1.gff
Reading gff file gff//GCF_009183675.1.gff
Reading gff file gff//GCF_009184805.1.gff
Reading gff file gff//GCF_016092595.1.gff
Reading gff file gff//GCF_016805155.1.gff
Reading gff file gff//GCF_016805185.1.gff
Loading PgR6MS class object.
Checking class.
Checking dimnames.
Creating gid (gene ids).
Checking provided cluster metadata.
Creating panmatrix.
Populating class.
Checking input sequences.
Checking that sequence names matches with DataFrame.
Adding metadata to sequences.
Done.

I didn't find any weird gene name in the example.

However, I just pushed some changes to address issue #59 , you can reinstall pagoo from source like:

devtools::install_github("iferres/pagoo") # installs pagoo 0.3.18

and try again. Let me know if everything works fine.

iferres / pagoo

gene IDs from Panaroo (1.3.2) presence/absence matrix file do not match IDs in GFF files #60