phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
117 stars 26 forks source link

AttributeError: 'float' object has no attribute 'split' #175

Closed vappiah closed 1 year ago

vappiah commented 1 year ago

Dear Developers,

I installed staramr (0.9.1) on an ubuntu 20.04 system using mamba When i try to run staramr (I am following the tutorial on your github page)

I get this error message. Please advice

No --plasmidfinder-database-type specified. Will search the entire PlasmidFinder database 2023-04-24 21:56:59 INFO: --output-dir set. All files will be output to [output] 2023-04-24 21:56:59 INFO: Will exclude ResFinder/PointFinder genes listed in [/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/exclude/data/genes_to_exclude.tsv]. Use --no-exclude-genes to disable 2023-04-24 21:56:59 INFO: Making BLAST databases for input files 2023-04-24 21:57:00 INFO: Scheduling blasts and MLST for isolate1.fasta 2023-04-24 21:57:00 INFO: Scheduling blasts and MLST for isolate2.fasta 2023-04-24 21:57:20 ERROR: 'float' object has no attribute 'split' Traceback (most recent call last): File "/home/bioinfocoach/apps/mamba/envs/staramr/bin/staramr", line 68, in args.run_command(args) File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 467, in run results = self._generate_results(database_repos=database_repos, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 287, in _generate_results amr_detection.run_amr_detection(files,pid_threshold, plength_threshold_resfinder, File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/detection/AMRDetection.py", line 194, in run_amr_detection self._pointfinder_dataframe = self._create_pointfinder_dataframe(pointfinder_blast_map, pid_threshold, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/detection/AMRDetectionResistance.py", line 56, in _create_pointfinder_dataframe return pointfinder_parser.parse_results() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 67, in parse_results self._handle_blast_hit(file, database_name, blast_out, results, hit_seq_records) File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 109, in _handle_blast_hit blast_results = self._get_result_rows(hit, database_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/pointfinder/BlastResultsParserPointfinder.py", line 98, in _get_result_rows results.append(self._get_result(hit, db_mutation)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/pointfinder/BlastResultsParserPointfinderResistance.py", line 55, in _get_result drug = self._arg_drug_table.get_drug(self._blast_database.get_organism(), hit.get_amr_gene_id(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/resistance/pointfinder/ARGDrugTablePointfinder.py", line 40, in get_drug return self._drug_string_to_correct_separators(drug.iloc[0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/resistance/ARGDrugTable.py", line 44, in _drug_string_to_correct_separators return ', '.join(drug.split(',')) ^^^^^^^^^^ AttributeError: 'float' object has no attribute 'split'

vappiah commented 1 year ago

I realized the problem was pandas. The pandas version 2.0.1 was giving issues. But when I downgraded 1.5.3 it worked. Just a warning message was displayed

/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py:544: FutureWarning: DataFrame.set_axis 'inplace' keyword is deprecated and will be removed in a future version. Use obj = obj.set_axis(..., copy=False) instead settings_dataframe.set_axis(['Value'], axis='columns', inplace=True) /home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py:200: FutureWarning: save is not part of the public API, usage can give unexpected results and will be removed in a future version writer.save()

apetkau commented 1 year ago

That's for reporting this issue. We will have to fix for our next release to make sure it's compatible with pandas >= 2

emarinier commented 1 year ago

I'm not convinced that this particular error is directly related to the version of Pandas. The original error message is happening in a method in the ARGDrugTable (here):

    def _drug_string_to_correct_separators(self, drug):
        """
        Converts a drug string (separated by commas) to use correct separators/spacing.
        :param drug: The drug string.
        :return: The drug string with correct separators/spacing.
        """
        return ', '.join(drug.split(','))

Basically, the method is attempting to take the drug (phenotype) and replace , characters with , (if they exist). It's failing in the original error message because drug isn't a String, but a float, which suggests that there was a problem with one of the entries in the ARG drug table file. Maybe one of the entries in the drug column is a float, or something loading the table didn't work correct.

I do see some entries are None, so maybe there's something going on in different versions of Pandas when loading that up, so I'm going to add a bit of safety checking soon to hopefully prevent similar errors in the future.

emarinier commented 1 year ago

It looks like the specific issue with this is that by default in pandas<2, "None" (which appears in the ARG drug table) is loaded as a String object. However, by default in pandas>2, "None" is loaded as a pd.NA value.

This causes problems when trying to parse the strings in the function I mentioned previously, because one is a string and the other is not. I'm going to try to resolve the issue by always loading "None" entries as pd.NA values for this particular table.

apetkau commented 1 year ago

Fixed in #194