phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
116 stars 26 forks source link

add more species to pointfinder analysis list #144

Closed pimarin closed 2 years ago

pimarin commented 2 years ago

Hi, I would like to increase the number of available species used by staramr from the pointfinder DB. I simply modified the list of species in staramr/blast/pointfinder/PointfinderBlastDatabase.py and compared the output with the pointfinder webservice, which are identical all tested species. Is there more to do as a first step to increase analysis ? Then, I would like:

    @classmethod
    def get_available_organisms(cls):
        """
        A Class Method to get a list of organisms that are currently supported by staramr.
        :return: The list of organisms currently supported by staramr.
        """
        return ['campylobacter', 'enterococcus_faecalis', 'enterococcus_faecium','escherichia_coli',
                'helicobacter_pylori', 'klebsiella','mycobacterium_tuberculosis','neisseria_gonorrhoeae',
                'plasmodium_falciparum','staphylococcus_aureus', 'salmonella']
apetkau commented 2 years ago

Hello @pimarin ,

Thanks so much for this PR. I really appreciate it 😄

Which dataset did you use to test out on the resfinder/pointfinder web service? Is it something anybody can download?

I describe a bit about why I hadn't added support for other species in pointfinder here https://github.com/phac-nml/galaxy_tools/issues/218#issuecomment-1099261708

In general, though, it's because there were some mutations in promotor regions (with negative coordinates) and deletions, which I had never explicitly added support for in staramr (though I have always intended to): https://bitbucket.org/genomicepidemiology/pointfinder_db/src/8706a6363bb29e47e0e398c53043b037c24b99a7/e.coli/resistens-overview.txt#lines-63:68

I'm not sure if the test dataset you used would include mutations in these regions, which is why I am wondering where it came from.

Implement the automatic detection of new species in the database when updated, with test for available file format

This is a great idea :). Do you have a particular method in mind? I had a small issue about this (https://github.com/phac-nml/staramr/issues/84), and thought of trying to just re-use the results of the mlst software (which auto-detects an mlst scheme which often corresponds to an organism). But there might be better ways then this.

Add an option to build a database from raw files which could be tranlated in the pointfinder format to be analyzed

Yes, this is another great idea. Another option would be to do a bit of refactoring/abstractions to provide support for these raw files to be directly loaded up in staramr (instead of converting them to the pointfinder directory structure). This could possibly be done by making abstractions of the classes in here https://github.com/phac-nml/staramr/tree/master/staramr/blast/pointfinder

apetkau commented 2 years ago

I apologize @pimarin since I think I misunderstood your first suggestion. You were referring more to detecting when new species are added to the PointFinder database, whereas I was thinking this was referring to automated detection of which organism a particular genome is so you no longer have to set --pointfinder-organism when running staramr. I think this suggestion is good as well.

However, I think I have a better solution. I was mostly using the list returned by get_available_organisms() to make sure that organisms/species from the PointFinder database aren't selected until I have validated that they work in staramr. However, maybe this is a bit too strict. I am thinking of switching this over so that you can pass any acceptable value to --pointfinder-organism that exists in the PointFinder database, but if it's not in the get_available_organisms() list, you will get a warning that the results produced by staramr for this PointFinder organism haven't been validated.

I think this is a better solution as it will let people run staramr with any new PointFinder organisms that are available (but still provide some feedback about which organisms have been validated). I I have made an issue for this: #147

I still do also plan to implement the support for indels that are keeping staramr from providing full support for other pointfinder organisms in the future.

I hope this would still work for you? I am going to close this PR.