phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
111 stars 25 forks source link

Dashes in updated pointfinder db cause crash #207

Open lerminin opened 3 months ago

lerminin commented 3 months ago

Hello,

I updated my databases in v0.10.0 with staramr db update --update-default. I noticed there are some new entries in the PointFinder database which have a different naming structure which cause this pipeline to error.

Command: staramr search -o e_coli.fasta -o results --pointfinder-organism escherichia_coli

Output:

conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/Bio/Application/__init__.py:40: BiopythonDeprecationWarning: The Bio.Application modules and modules relying on it have been deprecated.

Due to the on going maintenance burden of keeping command line application
wrappers up to date, we have decided to deprecate and eventually remove these
modules.

We instead now recommend building your command line and invoking it directly
with the subprocess module.
  warnings.warn(
2024-03-04 15:10:39 WARNING: Using non-default ResFinder/PointFinder. This may lead to differences in the detected AMR genes depending on how the database files are structured.
2024-03-04 15:10:39 INFO: No --plasmidfinder-database-type specified. Will search the entire PlasmidFinder database
2024-03-04 15:10:39 INFO: --output-dir set. All files will be output to [results_17]
2024-03-04 15:10:39 INFO: Will exclude ResFinder/PointFinder genes listed in [conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/databases/exclude/data/genes_to_exclude.tsv]. Use --no-exclude-genes to disable
2024-03-04 15:10:39 INFO: Will report complex mutations listed in [conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/databases/resistance/pointfinder/complex/data/complex_mutations.tsv]
2024-03-04 15:10:39 INFO: Making BLAST databases for input files
2024-03-04 15:10:39 INFO: Scheduling blasts and MLST for 17A19CPO005.fasta
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=catB3_2, accession=U13880
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=aac(6')-Ib-cr_1, accession=DQ303918
2024-03-04 15:10:47 WARNING: Multiple entries found for drug_class=all, gene=aac(6')-Ib-cr_1, accession=DQ303918
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=blaOXA-1_1, accession=HQ170510
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=blaCTX-M-15_1, accession=AY044436
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=blaCMY-42_1, accession=HM146927
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=qnrS1_1, accession=AB187515
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=blaOXA-181_1, accession=CM004561
2024-03-04 15:10:47 WARNING: No drug found for drug_class=all, gene=mph(A)_2, accession=U36578
2024-03-04 15:10:47 WARNING: Multiple entries found for drug_class=aminoglycoside, gene=aac(6')-Ib-cr_1, accession=DQ303918
2024-03-04 15:10:47 ERROR: invalid literal for int() with base 10: 'ampC-promoter-size-53'
Traceback (most recent call last):
File "conda/staramr_v0.10.0_updateddb/bin/staramr", line 68, in <module>
    args.run_command(args)
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 480, in run
    results = self._generate_results(database_repos=database_repos,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 296, in _generate_results
    amr_detection.run_amr_detection(files,pid_threshold, plength_threshold_resfinder,
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/detection/AMRDetection.py", line 198, in run_amr_detection
    self._pointfinder_dataframe = self._create_pointfinder_dataframe(pointfinder_blast_map, pid_threshold,
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/detection/AMRDetectionResistance.py", line 62, in _create_pointfinder_dataframe
    return pointfinder_parser.parse_results()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 67, in parse_results
    self._handle_blast_hit(file, database_name, blast_out, results, hit_seq_records)
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 105, in _handle_blast_hit
    partitions.append(self._create_hit(in_file, database_name, blast_record))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/blast/results/pointfinder/BlastResultsParserPointfinder.py", line 54, in _create_hit
    return PointfinderHitHSPPromoter(file, blast_record, database_name)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/blast/results/pointfinder/nucleotide/PointfinderHitHSPPromoter.py", line 20, in __init__
    self._parse_database_name(database_name)
  File "conda/staramr_v0.10.0_updateddb/lib/python3.11/site-packages/staramr/blast/results/pointfinder/nucleotide/PointfinderHitHSPPromoter.py", line 118, in _parse_database_name
    size = int(size_string.replace('bp', ''))  # remove the 'bp' and convert to an int
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'ampC-promoter-size-53'

From my understanding, it's because this function in PointfinderHitHSPPromoter.py is splitting on underscores instead of dashes:

def _parse_database_name(self, database_name):
    """
    Parses the name of the database in order to obtain the promoter offset.
    The database name is expected to have the following format:

    [GENENAME]_promoter_size_[SIZE]bp

    example:

    embA_promoter_size_115bp
    """
    tokens = database_name.split("_")  # split the name into tokens
    size_string = tokens[len(tokens) - 1]  # get the last token
    size = int(size_string.replace('bp', ''))  # remove the 'bp' and convert to an int

    self.offset = size

I modified my PointFinder database files for a quick workaround (renaming ampC-promoter-size-53 to ampC_promoter_size_53) and it runs fine for me now, but opening this as an FYI as there may be other genes with similar issues.