mdmparis / defense-finder

Systematic search of all known anti-phage systems.
GNU General Public License v3.0
75 stars 13 forks source link

[BUG] miscalculation of alignment coverage #68

Open jpjarnoux opened 2 months ago

jpjarnoux commented 2 months ago

Describe the bug The value of alignment coverage is miscalculated.

To Reproduce

lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 [locus_tag=QDJ88_RS00725] [protein=hypothetical protein] [protein_id=WP_278902228.1] [location=166201..167580] [gbkey=CDS] MNNHPKFEEYKNFIVNNEAYKGLPYVRKKDCSVVWVATKQSEIGKKRIEWALNKAEKFDISKNDSSFYRKVMFIVHPTKE KVCQICGESMSLDYIYPNKSFQKFLSKNYDYEPNVFDDIYDVSNHFKEEGKSNLEIIKILNKSIKLKGNFNDLTIYIKSV IEACRNGEKKSLGPGAMSNFPDRYDGFHSYNRCCRQEFDKGRNPDNMKTYSKDRRAYENWSDGNIHAANKFMSSKYFTGL SADHIGPISLGFIHDPRFMQPMSSGDNSSKRDRLEKSDIKKLIELESRYNLLPVSWFSEKIWKTLKADFLIDNNINLEYY RNLMKININNYMSLLYIIITETGDLGKKFLEENFLKTKQEYFKYDYEFDEFGNIISKTFRNISDATKKEYSRSIRISFES VIDFSEKENRNIKSSFNTNIAKNIDSLVYEINNCAPNKSIVNHFYKIVSAMQDDLLKEA lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 [gene=dcm] [locus_tag=QDJ88_RS00730] [protein=DNA (cytosine-5-)-methyltransferase] [protein_id=WP_278902230.1] [location=complement(167588..170410)] [gbkey=CDS] MELYNDLTKNTESYLKTLAMDIKRQQGIYYTDLELTDRIMDKVLEKISAEEINDINFLEPCVGSGNFVISYIKNAALKFK LNSNEIRKLIKNIYVCDSDKSAIDLFIRNLKYFIKKEYKIDLTNDDLPNIGGSLIYNLQLREDNYIGIDKYFGDIKFDVI VTNPPYKGLRAERRHYKTDDEYFNGRNYYNKIKEKSKHLHPLSGALNANIYRYFVEEIYTKYSKPDSIISILVPSSILTD KSCYDLRNKIIDFNITDIIQIPENNKYIDASQAMTNIITDSNHINSSVSIEKLNNGRIENYLVPKKNIVDEAHNNSILIM SSEDYKLLNKLKSISKVKDHKFIVNMRGELDITNNKDSITSKKTEYPLIRGRDIDRYCEVKYEKIKDYATKEFVNNSSKQ RYVIKNRLACQQIVNMNKKTRIGFTLIKENTVLANSCNFIFIEDNDYGIDEYYALAILNSKYCDWYFKIFSSNNHVNNYE IDLLPFPIGNSVQIQKISSLAKEQVLEFSNLRDNQINKIVTEIIDNFFGVENKINLNSTQIDELANKGLKEKNKVLSKKG ILNDKQYTLSELDLEIIKSVPQGGNWKNIPDKTIEKSKRLMKIRETGGRTTLYGRIDYTNPSYTITTYFNRPGNGCYIHP TQDRVLTTREGARIQCFPDDYYFYGNQRDILNQIGNAVPPLMGYLIAKKIKENLNVKKSLDLFSGAGGLLYGFKMAGVEH VLANDIDRSACVTLKINNPEVNVLCDDVTNDYTKEIIIDTAIQNNVDIICGGPPCQGFSLAGFRKSDDPRNKLVLDFADI IKNVEPKVFVFENVVGLLSYNKGETFNEIKKMFLALGYKLHAETLDFSDYGVPQRRRRVIIIGVRNNINIEPSKLFPDKI TKNKKISVMETIGDLDTNINSCNMNSKFISLMKNEISYDYYLDSIKENCENEIGEQLSIF

Steps to reproduce the behavior:

  1. defense-finder run -o . -w 1 input_file

  2. The output is lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 143 459 RM_Type_IIType_II_REases 2.400e-11 40.200 0.5380.401 243 426 lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 143 459 RM_Type_IIType_II_REases 1.100e-16 57.700 0.4690.344 78 235 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 DISARM_2__drmMII 8.700e-58 192.700 0.563 0.230699 914 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 Druantia_IIDruM 2.800e-57 191.200 0.525 0.210699 895 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 1.700e-05 20.700 0.4100.169 541 699 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 1.400e-07 27.000 0.5490.251 13 248 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 6.200e-08 28.100 1.0460.422 11 407 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 2.900e-15 52.800 0.5050.186 697 871 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 5.600e-26 87.900 0.5490.196 700 883 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 7.700e-28 94.100 0.5000.187 699 874 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 4.500e-46 154.100 0.5380.227 698 910 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 2.900e-48 161.300 0.5980.229 699 913 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_IIType_II_MTases 2.900e-52 174.700 0.9210.647 90 697 lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 2.000e-61 204.900 0.6090.231 698 914

  3. The file I got is attached / contains ...

  4. I got the following error : ....

Expected behavior

  1. When we look at hit_profile_cov and hit_seq_cov, we have a very high and low value, so I don't know if it's express between 0 and 1 or 0 and 100.
  2. When I calculate (hit_end_match - hit_begin_match)/hit_sequence_length I did not find either hit_profile_cov or hit_seq_cov.

Maybe I musunderstand something.

Screenshots

If applicable, add screenshots to help explain your problem.

Please complete the following information :

OS:

Version of your the defenseFinder software and Models

mdmparis-defense-finder==1.2.2 macsyfinder==2.1.1 models 1.2.4

jpjarnoux commented 1 month ago

Hi ! Maybe I figured it out the problem. In MSF there is one file for one HMM profile, in DF there are multiple profile in the same file. When MSF/DF look at the profile coverage it use the first length of the first profile. So if the first profile length is short all the assignment of annotation are false. This issue is a big problem because it affect most of profile and systems like RM.

jpjarnoux commented 1 month ago

I suspect that the problem can also affect the GA cutoff, but I'm not sure about this one.