Open lerminin opened 1 year ago
Thanks for the great write-up of this issue. The differences you see is interesting. I don't quite have time to look in-depth right now, but one thing you could try is to add the option --report-all-blast
to StarAMR.
StarAMR runs blast against the ResFinder database, which can consist of many sequences for the same gene with minor variation. StarAMR then attempts to pick the "best" blast hit based on some logic related to percent identity and length. Logic for this is here:
Maybe something strange is going on with how the blast hit to report is being selected? The --report-all-blast
includes all blast hits in the output table.
Thanks for the quick reply. I tried running with the --report-all-blast
flag but am only getting one hit per gene copy in the output file. Does this suggest its an underlying blast issue?
It's possibly a blast issue. You can run with verbose turned on staramr --verbose search ...
to see more information, including each blast hsp.
I did test out by adding qacE_1_X68232 to the beginning and ending of an example genome (and reverse complementing the 2nd sequence) and here are the records I get:
2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid qacE_1_X68232
sseqid contig00001
pident 100.0
length 333
qstart 1
qend 333
sstart 1
send 333
slen 641340
qlen 333
sstrand plus
sseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength 100.0
Name: 0, dtype: object
2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid qacE_1_X68232
sseqid contig00001
pident 100.0
length 333
qstart 1
qend 333
sstart 641313
send 640981
slen 641340
qlen 333
sstrand minus
sseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength 100.0
Name: 1, dtype: object
That is, they have identical percent identity and length.
Maybe there is some particular sequence around one of the copies of your qacE gene that is causing blast to report a slightly different hsp?
I will take a look at the flanking sequences. This is my output from running the above plasmids with --verbose
:
2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid qacE_1_X68232
sseqid p03A17005_A_consensus
pident 100.0
length 282
qstart 1
qend 282
sstart 33598
send 33879
slen 99166
qlen 333
sstrand plus
sseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength 84.684685
Name: 0, dtype: object
2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid qacE_1_X68232
sseqid p03A17005_A_consensus
pident 99.649
length 285
qstart 1
qend 285
sstart 60928
send 60644
slen 99166
qlen 333
sstrand minus
sseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength 85.585586
Name: 1, dtype: object
A pattern I've noticed is that this is only happening in cases (qacE, blaOXA-10, catB3) where there is a less than 100% length match to the reference sequence. Could that possibly be influencing this discrepancy? If I align my plasmid sequence against the reference 333 bp length of X68232 in the online NCBI blastn interface, I get two hits that have 100% identity over 282/333 nucleotides for both strands, which I would interpret as correct
Thanks. We will have to investigate this further later on as we don't have the resources to look into this issue right now.
My suspicion though is differences in blast parameters or blast versions. You could try directly running the version of blast used by StarAMR to look at the output more closely.
Hello,
I've come across an odd discrepancy - I have a plasmid with two copies of qacE gene (X68232) on opposite strands and I'm getting differing percent identity/alignment length to the reference for each copy despite the nucleotide sequences being identical. The output I get from StarAMR/Resfinder is the following:
Initially this suggested to me that these are two distinct sequences: one on the forward strand with 100 % identity over 282 nt, and one on the reverse strand with 99.65% identity over 285 nt. However, I aligned both qacE sequences to the X68232 reference to search for where the sequences differed and found the following:
Seems like the alignment is ending at 282 nt for the forward strand, but the next three bases in the reverse strand are included with a single mismatch over the 285 nt length. Even though the forward strand has the same sequence, the last three nts are being excluded. Given the nt sequences are identical on both strands, I would expect the ResFinder output to be the same for both strands. Is this an artifact of StarAMR or is this a BLAST issue?
I'm also noticing a similar pattern with blaOXA-10 (J03427) - getting 100 % alignment over 768/801 nt in the reference when blaOXA-10 is on the reverse strand, but 99.24 % alignment over 788/801 nt in the reference when blaOXA-10 is on the forward strand. Sequences between forward and reverse are 100% identical in both cases.
plasmid file: plasmid.fasta.txt version = 0.10.0 (same issue occurs in v0.9.1) resfinder_gene_drug_version = 072621 resfinder_db_date = Tue, 24 May 2022 06:51