jodyphelan / ntmdb

3 stars 1 forks source link

Reference genomes for M. Abscessus? #1

Closed harrismia closed 2 years ago

harrismia commented 2 years ago

Hi Jody,

Which reference genomes do you recommend using for M. abscessus? I understand that each of the three subspecies: 

 M. abscessus subspecies abscessus, M. abscessus subspecies massiliense, M. abscessus subspecies  bolletii 

has a different reference genome.  
jodyphelan commented 2 years ago

I think initially to keep it simple we can use the reference genome for M. abscessus subspecies abscessus for all subspecies. They seem to be similar enough that the reads align to genome from all species.

harrismia commented 2 years ago

We are still experimenting with aligning reads to the different references. Using the subspecies abscesses reference for alignment of reads seems like a good way to get started.

For reference genomes, we are using the following: (accessions are NCBI).

I believe the M. abscesses subspecies abscesses is the same as the one that you mentioned.

When we run our samples through the NTM-Profiler speciation function, there were no m. abscesses subspecies bolletii called. Do you have kmers for bolletii included in the kmer list?

jodyphelan commented 2 years ago

Thanks for looking into this.

I couldn't find any annotations present for NZ_AP018436.1 or NZ_AP014547.1. Do you know if they have been released?

I had a look and found that some kmers were indeed missing. I'll release an update for that soon!

lorenziha commented 2 years ago

Hi Jody. The annotations for NZ_AP018436.1 or NZ_AP014547.1 assemblies can be found in the links below:

For AP018436.1: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/609/715/GCA_003609715.1_ASM360971v1/

For AP014547.1: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/497/265/GCA_000497265.2_ASM49726v2/

jodyphelan commented 2 years ago

Thanks @lorenziha!

I've had a look at the resistance genes for the different references and noticed some differences.

For the gene rrl I noticed that the annotations for CU458896.1 and NZ_AP018436.1 differ in the starting position by 1 nt.

CU458896.1

image

NZ_AP018436.1

image

Alignment

image

This means that all the gene positions for mutations will be different too. For example, sample ERR369334 (M. abscessus subsp. bolletii) has a mutation which depending on which reference we use will be called as n.2270A>T (using CU458896.1) or n.2269A>T (using NZ_AP014547.1). It seems most papers use n.2270A>T when talking about the mutation (e.g. in this article):

The most commonly identified rrl mutations were the canonical A2270N (E. coli A2058N) (35/51, 68.6%) and A2271N (E. coli A2059N) (8/51, 15.6%).

It is also interesting to note that the although the bolletii reference starts later, missing out the C, it does actually have the same sequence as the abscessus reference (including the missing C) so I am wondering if this is an annotation error? Just checking to avoid outputting the different positions for the different species and perhaps adding confusion when comparing between datasets.

lorenziha commented 2 years ago

Hi Jody,

Thanks for the information. I just did some additional research at GenBank and, based on a 23S rRNA entry curated by NCBI as part of their NCBI RefSeq Targeted Loci Project from Mycobacterium abscessus subsp. bolletii strain GO 06 (see link below), the start and end coords of CU458896.1 seem to be the correct ones. The annotation of the rrl gene in NZ_AP018436.1 was automatically predicted with INFERNAL:1.1.1, using the gene prediction method "cmsearch" and based on the following Rfam model "RFAM:RF02541" and therefore, it is most likely to be incorrect.

https://www.ncbi.nlm.nih.gov/nuccore/NR_076981.1

jodyphelan commented 2 years ago

Ah thanks for looking into it. I guess there are a few options:

1) Use as is with the wrong coordinates 2) Change the entry in the GFF file and store the correct one 3) Use CU458896.1 for all subspecies

3 is probably the easiest maintenance-wise but if people would like to use the bams in other downstream analysis then 2 might be a better option

Let me know your thoughts on this @lorenziha and @harrismia

jeffreybm commented 2 years ago

Hi Jody,

After talking with @lorenziha and @harrismia we agree that your initial suggestion of going with the abscessus CU458896.1 as reference for DST prediction is best (option 3).

When looking at the different genbank entries we found a few other issues with naming and coordinates for genes being different between eg. entry AP018436.1 vs the newer refseq NZ_AP018436.1 in addition to what you posted above. Too many little things to change or keep in mind.

I think the 'erm' gene story would add some additional difficulties if using a separate reference for each subspecies, where the subsp. massielense reference lacks a functional erm gene, causing inherent sensitivity to macrolides. Would have to add additional logic in to code to deal with cases where using massielense as reference compared to just using abscessus CU458896.1.

I was looking at the rrs gene for amikacin resistance, the coordinate n.1374A>G given in the nature article may be off by one. 1374 is a C, and 1375 is an A. The genbank and the GFF from ensembl you linked above both have the 16S gene starting at coord 1462398 so dont see any disagreement there.

Cheers

jodyphelan commented 2 years ago

Hi @jeffreybm

Ok thanks for all the feedback! I've got a working resistance profiling function now for M. abscessus and will be releasing in the next few days.