How to use dndscv in bacteria species?

loukesio commented 3 years ago

Thank you for the nice package.

I am interested in using it for Pseudomonas and many other bacteria species if I figure out how. I do not have custom transcript but there are databases from where we can isolate the CDS and their positions. I am not so sure how strict someone has to be in RefCDS format that you propose in the tutorial (?) My point is if I extract the CDS information from a gff file will I be able to run the package? I have almost all the info expect of the CDS id.

thank you for your time

im3sanger commented 3 years ago

Hello,

Thanks for your interest in dNdScv.

Yes, it should be easy to generate a RefCDS object for a bacterial species using a gff file. You will need to modify the gff to create a transcript table with the columns shown in the buildref tutorial. The format for bacteria is simple as you do not need to worry about exons and introns. You simply need one line per gene. Below is an example showing two genes of the reference table for the SARS-CoV-2 genome:

gene.id gene.name   cds.id  chr chr.coding.start    chr.coding.end  cds.start   cds.end length  strand
ORF1ab:nsp2 ORF1ab:nsp2 ORF1ab:nsp2 MN908947.3  806 2719    1   1914    1914    1
ORF1ab:nsp3 ORF1ab:nsp3 ORF1ab:nsp3 MN908947.3  2720    8554    1   5835    5835    1

Here I am using the same gene name for the gene.id, gene.name and cds.id columns. The 4th column is the chromosome name (from the FASTA header), the 5th and 6th columns are the start and end of the gene. The 6th column (cds.start, the first coding base of the segment) is "1" in your case, the 7th and 8th columns (cds.end and cds.length) are simply the gene length. And the last column is the strand.

I hope this makes sense. Let me know if you experience any problems.

Best wishes, Inigo

loukesio commented 3 years ago

Dear Inigo, Thank you for your prompt reply. I highly appreciate it. I tried again today to prepare the cds_table, and my problems lie in the fact that I can not add chr.coding.start and chr.coding.end columns. Stupid question How different is the chr.coding.start and chr.coding.end to the cds.start and cds.end (?)

Here in this link you can find the my initial gtf and my effort to create cds_table [https://drive.google.com/drive/folders/17gRlF1z3Q7wtBLqJwH7KB01nZslZ7kAa?usp=sharing]

Here is my code to convert the gtf to the dNdScv format

gtf <- read.table("SBW25_gtf.gtf") head(gtf)

cds_table = gtf %>% filter(V3=="CDS") %>% select(.,V10,V1,V4,V5,V7) %>% rename(gene.id=V10,chr=V1, cds.start=V4, cds.end=V5, strand=V7) %>% mutate(length=cds.end-cds.start+1) %>% mutate(gene.name=gene.id, cds.id=gene.id) %>% relocate(c(gene.name,cds.id), .after=gene.id)

I would appreciate it if you can have a look. I am sorry in advance if I bother you. I find dNdScv incredible, and I am looking forward to applying it. Thank you for your time with best wishes, Loukas

im3sanger commented 3 years ago

Hi Loukas,

Thanks. I cannot access the google drive file that you link there to give you a more detailed answer. I just requested permission.

In the meantime, I will try to clarify what chr.coding.start, chr.coding.end, cds.start and cds.end mean for a bacterial genome (no splicing) where each gene takes one row in the transcript table:

chr.coding.start and chr.coding.end: These are the start and end coordinates for the gene in the bacterial genome.
cds.start and cds.end: These are the positions of the first and last coding bases in the segment defined by chr.coding.start and chr.coding.end. In practice, in bacteria, where you have an entry per gene, cds.start = 1 and cds.end = length.

For example, imagine that geneA is located chrX with coordinates 10290 to 13451 (length = 13451-10290+1 = 3162) and that it is encoded in the -1 strand, then the entry for this gene in the reference table would be as follows:

gene.id gene.name   cds.id  chr chr.coding.start    chr.coding.end  cds.start   cds.end length  strand
geneA   geneA   geneA   chrX    10290   13451   1   3162    3162    -1

Does this make sense? I appreciate that several columns are redundant here (a consequence of this format being originally taken from Ensembl BioMart for species with introns and non-coding exonic regions). But it should be straightforward to create a reference table for bacterial simply knowing the start and end position of the genes and their strand.

I will look into your file once I receive access.

Best wishes, Inigo

loukesio commented 3 years ago

Thank you Inigo for your prompt reply. Once more I am impressed for the ample help that you offer.

I should have made the documents open to everyone just in case someone has similar issues. I ll make sure that they are publicly available. Sorry for that.

I think I got what you mean. I ll let it digest a bit and coming back with an update.

im3sanger commented 3 years ago

Hello,

No worries. The tutorial for buildref is too focused on eukaryotic genomes at the moment. I should modify it to include simpler examples for bacterial or viral genomes. But for now this thread could help other users.

I have looked into your files. I have noticed a few problems. Your cds file is missing some columns and has others in the wrong order. Your chromosome name in the cds table (NC_012660) did not match that in the fasta file (NC_012660.1), and strand information should be coded as "1" and "-1", rather than "+/-". Once I fixed these minor issues, I was able to run buildref without problems (it ran in a few minutes).

You can access the output RefCDS rda file here and the correct transcript table here.

Once you run buildref, it is always a good idea to load the RefCDS object into R and look at the number and the list of genes successfully included in the object. If some genes contain "N" nucleotides or are incomplete (length not multiple of 3 as expected for coding sequences), they will be removed by buildref. So you should check that the object created contains your genes of interest. As far as I can see in your case, all genes were successfully processed (n = 5921).

genes = sapply(RefCDS, function(x) x$gene_name)

A more conceptual note... I have not used dNdScv on bacterial genomes before, although I have used it on SARS-CoV-2 data. It should be suitable to study mutation accumulation data (like Lenski's). It should also be applicable to natural variation within a bacterial species, but you should be careful when analysing very divergent sequences.

I hope this helps.

Best, Inigo

loukesio commented 3 years ago

Dear Inigo,

Thank you for having a look at my documents and correcting them. I appreciate it a lot. Please feel free to add Pseudomonas fulorescens SBW25 in your list of RedCDS .rda extension files.

1. I'll make sure today that my inquiry files are publicly available. 2. I have an experimental setup that is quite similar to the Lenski experiment. I am preparing my time-series files atm with the mutants. If you want, I can update you with another comment on how did it go (?).

I highly appreciate the time you have spent helping me.

with best wishes, Loukas

im3sanger commented 3 years ago

Hi Loukas,

Thank you, that would be great. If you are happy to share your experience, I would be interested in hearing how dndscv performed on your data. In an experiment like Lenski's I think the model should work well. Horizontal gene transfer in natural variation, could be more problematic, as it could violate the underlying Gamma model of mutation rates across genes (dndsloc could be useful in this context).

Don't hesitate to contact me if you have any other issues. Feedback from users applying dndscv outside of humans is always useful as I have mostly used it on human data.

Best, Inigo

im3sanger / dndscv

How to use dndscv in bacteria species? #62