phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
111 stars 25 forks source link

AttributeError: 'float' object has no attribute 'split' #175

Closed vappiah closed 10 months ago

vappiah commented 1 year ago

Dear Developers,

I installed staramr (0.9.1) on an ubuntu 20.04 system using mamba When i try to run staramr (I am following the tutorial on your github page)

I get this error message. Please advice

No --plasmidfinder-database-type specified. Will search the entire PlasmidFinder database 2023-04-24 21:56:59 INFO: --output-dir set. All files will be output to [output] 2023-04-24 21:56:59 INFO: Will exclude ResFinder/PointFinder genes listed in [/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/exclude/data/genes_to_exclude.tsv]. Use --no-exclude-genes to disable 2023-04-24 21:56:59 INFO: Making BLAST databases for input files 2023-04-24 21:57:00 INFO: Scheduling blasts and MLST for isolate1.fasta 2023-04-24 21:57:00 INFO: Scheduling blasts and MLST for isolate2.fasta 2023-04-24 21:57:20 ERROR: 'float' object has no attribute 'split' Traceback (most recent call last): File "/home/bioinfocoach/apps/mamba/envs/staramr/bin/staramr", line 68, in args.run_command(args) File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 467, in run results = self._generate_results(database_repos=database_repos, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py", line 287, in _generate_results amr_detection.run_amr_detection(files,pid_threshold, plength_threshold_resfinder, File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/detection/AMRDetection.py", line 194, in run_amr_detection self._pointfinder_dataframe = self._create_pointfinder_dataframe(pointfinder_blast_map, pid_threshold, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/detection/AMRDetectionResistance.py", line 56, in _create_pointfinder_dataframe return pointfinder_parser.parse_results() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 67, in parse_results self._handle_blast_hit(file, database_name, blast_out, results, hit_seq_records) File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/BlastResultsParser.py", line 109, in _handle_blast_hit blast_results = self._get_result_rows(hit, database_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/pointfinder/BlastResultsParserPointfinder.py", line 98, in _get_result_rows results.append(self._get_result(hit, db_mutation)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/blast/results/pointfinder/BlastResultsParserPointfinderResistance.py", line 55, in _get_result drug = self._arg_drug_table.get_drug(self._blast_database.get_organism(), hit.get_amr_gene_id(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/resistance/pointfinder/ARGDrugTablePointfinder.py", line 40, in get_drug return self._drug_string_to_correct_separators(drug.iloc[0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/databases/resistance/ARGDrugTable.py", line 44, in _drug_string_to_correct_separators return ', '.join(drug.split(',')) ^^^^^^^^^^ AttributeError: 'float' object has no attribute 'split'

vappiah commented 1 year ago

I realized the problem was pandas. The pandas version 2.0.1 was giving issues. But when I downgraded 1.5.3 it worked. Just a warning message was displayed

/home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py:544: FutureWarning: DataFrame.set_axis 'inplace' keyword is deprecated and will be removed in a future version. Use obj = obj.set_axis(..., copy=False) instead settings_dataframe.set_axis(['Value'], axis='columns', inplace=True) /home/bioinfocoach/apps/mamba/envs/staramr/lib/python3.11/site-packages/staramr/subcommand/Search.py:200: FutureWarning: save is not part of the public API, usage can give unexpected results and will be removed in a future version writer.save()

apetkau commented 1 year ago

That's for reporting this issue. We will have to fix for our next release to make sure it's compatible with pandas >= 2

emarinier commented 10 months ago

I'm not convinced that this particular error is directly related to the version of Pandas. The original error message is happening in a method in the ARGDrugTable (here):

    def _drug_string_to_correct_separators(self, drug):
        """
        Converts a drug string (separated by commas) to use correct separators/spacing.
        :param drug: The drug string.
        :return: The drug string with correct separators/spacing.
        """
        return ', '.join(drug.split(','))

Basically, the method is attempting to take the drug (phenotype) and replace , characters with , (if they exist). It's failing in the original error message because drug isn't a String, but a float, which suggests that there was a problem with one of the entries in the ARG drug table file. Maybe one of the entries in the drug column is a float, or something loading the table didn't work correct.

I do see some entries are None, so maybe there's something going on in different versions of Pandas when loading that up, so I'm going to add a bit of safety checking soon to hopefully prevent similar errors in the future.

emarinier commented 10 months ago

It looks like the specific issue with this is that by default in pandas<2, "None" (which appears in the ARG drug table) is loaded as a String object. However, by default in pandas>2, "None" is loaded as a pd.NA value.

This causes problems when trying to parse the strings in the function I mentioned previously, because one is a string and the other is not. I'm going to try to resolve the issue by always loading "None" entries as pd.NA values for this particular table.

apetkau commented 10 months ago

Fixed in #194