phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
111 stars 25 forks source link

different percent identity/alignment length for identical sequences on opposite strands #201

Open lerminin opened 9 months ago

lerminin commented 9 months ago

Hello,

I've come across an odd discrepancy - I have a plasmid with two copies of qacE gene (X68232) on opposite strands and I'm getting differing percent identity/alignment length to the reference for each copy despite the nucleotide sequences being identical. The output I get from StarAMR/Resfinder is the following:

Isolate ID Data Data Type Predicted Phenotype CGE Predicted Phenotype %Identity %Overlap HSP Length/Total Length Contig Start End Accession
plasmid qacE Resistance unknown[qacE_1_X68232] Benzylkonium Chloride, Ethidium Bromide, Chlorhexidine, Cetylpyridinium Chloride 100.0 84.68 282/333 p03A17005_A_consensus 33598 33879 X68232
plasmid qacE Resistance unknown[qacE_1_X68232] Benzylkonium Chloride, Ethidium Bromide, Chlorhexidine, Cetylpyridinium Chloride 99.65 85.59 285/333 p03A17005_A_consensus 60928 60644 X68232

Initially this suggested to me that these are two distinct sequences: one on the forward strand with 100 % identity over 282 nt, and one on the reverse strand with 99.65% identity over 285 nt. However, I aligned both qacE sequences to the X68232 reference to search for where the sequences differed and found the following:

Seems like the alignment is ending at 282 nt for the forward strand, but the next three bases in the reverse strand are included with a single mismatch over the 285 nt length. Even though the forward strand has the same sequence, the last three nts are being excluded. Given the nt sequences are identical on both strands, I would expect the ResFinder output to be the same for both strands. Is this an artifact of StarAMR or is this a BLAST issue?

I'm also noticing a similar pattern with blaOXA-10 (J03427) - getting 100 % alignment over 768/801 nt in the reference when blaOXA-10 is on the reverse strand, but 99.24 % alignment over 788/801 nt in the reference when blaOXA-10 is on the forward strand. Sequences between forward and reverse are 100% identical in both cases.

plasmid file: plasmid.fasta.txt version = 0.10.0 (same issue occurs in v0.9.1) resfinder_gene_drug_version = 072621 resfinder_db_date = Tue, 24 May 2022 06:51

apetkau commented 9 months ago

Thanks for the great write-up of this issue. The differences you see is interesting. I don't quite have time to look in-depth right now, but one thing you could try is to add the option --report-all-blast to StarAMR.

StarAMR runs blast against the ResFinder database, which can consist of many sequences for the same gene with minor variation. StarAMR then attempts to pick the "best" blast hit based on some logic related to percent identity and length. Logic for this is here:

https://github.com/phac-nml/staramr/blob/d8370b70b83f08711d53421d10d06a4cee369af5/staramr/blast/results/BlastResultsParser.py#L115-L122

Maybe something strange is going on with how the blast hit to report is being selected? The --report-all-blast includes all blast hits in the output table.

lerminin commented 9 months ago

Thanks for the quick reply. I tried running with the --report-all-blast flag but am only getting one hit per gene copy in the output file. Does this suggest its an underlying blast issue?

apetkau commented 9 months ago

It's possibly a blast issue. You can run with verbose turned on staramr --verbose search ... to see more information, including each blast hsp.

I did test out by adding qacE_1_X68232 to the beginning and ending of an example genome (and reverse complementing the 2nd sequence) and here are the records I get:

2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                           contig00001
pident                                                 100.0
length                                                   333
qstart                                                     1
qend                                                     333
sstart                                                     1
send                                                     333
slen                                                  641340
qlen                                                     333
sstrand                                                 plus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                                100.0
Name: 0, dtype: object
2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                           contig00001
pident                                                 100.0
length                                                   333
qstart                                                     1
qend                                                     333
sstart                                                641313
send                                                  640981
slen                                                  641340
qlen                                                     333
sstrand                                                minus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                                100.0
Name: 1, dtype: object

That is, they have identical percent identity and length.

Maybe there is some particular sequence around one of the copies of your qacE gene that is causing blast to report a slightly different hsp?

lerminin commented 8 months ago

I will take a look at the flanking sequences. This is my output from running the above plasmids with --verbose:

2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                 p03A17005_A_consensus
pident                                                 100.0
length                                                   282
qstart                                                     1
qend                                                     282
sstart                                                 33598
send                                                   33879
slen                                                   99166
qlen                                                     333
sstrand                                                 plus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                            84.684685
Name: 0, dtype: object
2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                 p03A17005_A_consensus
pident                                                99.649
length                                                   285
qstart                                                     1
qend                                                     285
sstart                                                 60928
send                                                   60644
slen                                                   99166
qlen                                                     333
sstrand                                                minus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                            85.585586
Name: 1, dtype: object

A pattern I've noticed is that this is only happening in cases (qacE, blaOXA-10, catB3) where there is a less than 100% length match to the reference sequence. Could that possibly be influencing this discrepancy? If I align my plasmid sequence against the reference 333 bp length of X68232 in the online NCBI blastn interface, I get two hits that have 100% identity over 282/333 nucleotides for both strands, which I would interpret as correct

apetkau commented 8 months ago

Thanks. We will have to investigate this further later on as we don't have the resources to look into this issue right now.

My suspicion though is differences in blast parameters or blast versions. You could try directly running the version of blast used by StarAMR to look at the output more closely.