michaelkyu / PlasX

PlasX, a machine learning classifier for identifying plasmid sequences based on genetic architecture
GNU General Public License v3.0
28 stars 1 forks source link

some search_de_novo_families fail #4

Open smb20200615 opened 2 years ago

smb20200615 commented 2 years ago

Hi some of my runs fail with this error

Traceback (most recent call last): File "conda/envs/plasx/bin/plasx", line 8, in sys.exit(run()) File "conda/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 140, in run args.func(args) File "conda/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 38, in search annotate_de_novo_families(args.gene_calls, File "conda/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1948, in annotate_de_novo_families hits = process_mmseqs_merge_search(mmseqs_source_db, target_db_dir, mmseqs_dir, ident_list, File "conda/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1757, in process_mmseqs_merge_search hits = pd.concat([shallow_filter(utils.unpickle(search_results_pattern.format(ident=ident)).assign(cluster_identity=ident), File "conda/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1757, in hits = pd.concat([shallow_filter(utils.unpickle(search_results_pattern.format(ident=ident)).assign(cluster_identity=ident), File "conda/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1820, in shallow_filter hits['q_length'] = utils.int_loc(hits['qId'].values, q_len) File "conda/envs/plasx/lib/python3.8/site-packages/plasx/pd_utils.py", line 48, in int_loc assert np.all(isin_int(query, domain.index)) # Check that all of query is in the domain File "conda/envs/plasx/lib/python3.8/site-packages/plasx/pd_utils.py", line 63, in isin_int max_val = np.max(series) File "<__array_function__ internals>", line 5, in amax File "conda/envs/plasx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "conda/envs/plasx/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation maximum which has no identity

Command used plasx search_de_novo_families -g genecall.txt -o denovoltxt --threads 1 --splits 32 --overwrite

Do you know what could be happening?

michaelkyu commented 2 years ago

Hmm, I'm not sure what's happening. Would you mind sharing your input file genecall.txt (or a subset of it, e.g. with head -100 genecall.txt). You can attach it here in a comment, or email me at mikeyu@ttic.edu.

smb20200615 commented 2 years ago

Thank you so much for the prompt reply!

gene_callers_id contig  start   stop    direction   partial call_type   source  version aa_sequence
0   c_000000000001  271 829 r   0   1   prodigal    v2.6.3  MGNTTYLKINSENDVDLQDILNDFINCFCKGYVEIKTKYKLLPIFKINFHKNNLPHLLGLHYTHKKVSAKKIIGRIAEGKITHESIKKHYEYSNIKDRLINYNFLHKCFIDKEIRLCVIVPKNSINPQKIDVAFIDDKNSQVMILGLRKSNNNDFYSPATMYVLGKNSSYRRMRRTHVISIEWKN
1   c_000000000001  1483    2938    r   0   1   prodigal    v2.6.3  MLMTKNQAEKWFDNSLGKQFNPDLFFGFQCYDYANMFFMLATGERLQGLYAYNIPFDNKARIEKYGQIIKNYDSFLPQKLDIVVFPSKYGGGAGHVEIVESANLNTFTSFGQNWNGKGWTNGVAQPGWGPETVTRHVHYYDDPMYFIRLNFPDKVSVGNKAKSVIKQATAKKQAVIKPKKIMLVAGHGYNDPGAVGNGTNERDFIRKYITPNIAKYLRHAGHEVALYGGSSQSQDMYQDTAYGVNVGNNKDYGLYWVKSQGYDIVLEIHLDAAGENASGGHVIISSQFNADTIDKSIQDVIKNNLGQIRGVTPRNDLLNVNVSAEININYRLSELGFITNKKDMDWIKKNYDLYSKLIAGAIHGKPIGGLVAGNAKTSAKNQKNPPVPVGYTLDKNNVPYKKEDGNYTVANVKGNNVRDGYSTNSRITGVLPNNATIKYDGAYCINGYRWITYIANSGQRRYIATGEVDKAGNRISSFGKFSTI
2   c_000000000001  2948    3251    r   0   1   prodigal    v2.6.3  MDAKVITRYIVLILALVNQFLANKGISPIPVDDETISSIILTVVALYTTYKDNPTSQEGKWANQKLKKYKAENKYRKATGQAPIKEVMTPTNMNDTNDLG
3   c_000000000001  3386    3686    r   0   1   prodigal    v2.6.3  MFGFTKRHEQDWRLTRLEENDKTMFEKFDRIEDSLRTQEKIYDKLDRNFEELRRDKEEDEKNKEKNAKNIRDIKMWILGLIGTILSTFVIALLKTIFGI
4   c_000000000001  3731    3896    r   0   1   prodigal    v2.6.3  MLKLISPTFEDIKTWYQLKEYSKEDIAWYVDMEVIDKEEYAIITGEKYPENLES
5   c_000000000001  3888    4278    r   0   1   prodigal    v2.6.3  MQILVNKRNEIISYAIIGGFEEGIDIENLPENFSQVFRPKAFKYSNGEIVFNEDYSEEKDDLHQQIDSEEQNTVASDDILRKMVASMQKQVVQSTKLSMQVNKQNALMAKQLVTLNKKLEEVKGETENA
6   c_000000000001  4277    5777    r   0   1   prodigal    v2.6.3  MDFTRRENYKLMSNLEKSVAINLENTAHYENISNLDITFRTGESDSSVLLFNIIKNNQPLLLSEENIKARIAIRGKGVMIVAPLEILDPFKGILKFQLPNDVIKRDGSYQAQVSVAELGNSDVVVVERTITFNVEKSLFSKVPSETKLHYIVEFQELEKTIMDRAKAMDEAIKNGEDYASLIEKAKEKGLSDIQIAKSSSIDELKQLANSRISDLENKAQAYSRTFDEQKRYMDEKHEAFKQSVNSGGLVTSGSTSNWQKAKITKDDGKIMQITGFDFNNPEQRIGDSTQFIYVSQAINYPRGASTNGTVEYLVVTSDYKRMTYRPNGTNKVFVKRKEVGSWSDWSELALNDYNTPFETVQNAQSKANTAESNAKLYTDDKFNKRYSVIFDGTANGVGSTLYLNESLDQFILLIFYGTFPGGDFTEFGNPFGGGKISLNPSNLPDNDGDGGGVYEFGLTKSSRTSLTISNDVYFDLGSRRGSGANANRGTINKIIGVRK
7   c_000000000001  5743    7654    r   0   1   prodigal    v2.6.3  MENLYLIKDLGALAGRDYRAKEIQNLQRIEQFALGLTTEFKLHQKAKTIQHFAEQIYYNGRSQAAVNKSLQSQINALVVAPRNNSANEIVQARVNVNGETFDTLKEHLDDWETKTQINKEETIRELNKTKQEILDIEYRFEPDKQEFLFVTELAPLTNAVMQSFWFDNRTGIVYMTQARNNGYMLSRLRPNGQFIDSSLIVGGGHGTHNGYRYIDDELWIYSFILNGNNENTLVRFKYTPNVEISYGKYGMQDVFTGHPEKPYITPVINEKENKILYRIERPRSQWELENSMNYIEIRSLDDVDKNIDKVLHKISIPMRLTNETQPMQGVTFDEKYLYWYTGDSNPNNRNYLTAFDLETGEEAYQVNADYGGTLDSFPGEFAEAEGLQIYYDKDSGKKALMLGVTVGGDGNRTHRIFMIGQRGILEILHSRGVPFIMSDTGGRVKPLPMKPDKLKNLGMLTEPGLYYLYTDHTVQIDDFPLPREWRDAGWFLEVKPPQTGGDVIQILTRNSYARNMMTFERVLSGRTGDISDWNYVPKNSGKWERVPSFITKMSDINIVGMSFYLTTDDTKRFTDFPTERKGVAGWNLYVEASNTGGFVHRLVRNSVTASCEILLKNYDSKTSSGPWTLHEGRIIS
8   c_000000000001  7669    7960    r   0   1   prodigal    v2.6.3  MATEEVKIKALLENDKQYFPATHWKAINGIPYAGSSDIDGLPQDGIISVDDKNKLDNLKIGEAGIIQNSIVQKSPNGKLWKITVDDSGKLGTVLFY
9   c_000000000001  7959    9543    r   0   1   prodigal    v2.6.3  MDYHDHLSVMDFNELICENLLDVDYGSFKEYYELNEARYITFTVYRTTHNSFVFDLLICENFIIYHGEKYTIKQTAPKVEGDKVFIEVTAYHIMYEFQNHSVESNKLDDDSSETGKTPEYSLDEYLRYGFANQKTSVKMTYKIIGDFKRKIPIDELGNKNGLEYCKEAVDLFGCIIYPNDTEICFYSPETFYQRSEKVIRYQYNTDTVSATVSTLELRTAIKVFGKKYTAEEKKNYNPIRTTDIKYSNGFIKEGTYRTATIGSKATINFDCKYGNETVRFTIKKGSQGGIYKLILDGKQIKQISCFAKSVQSETIDLIKNIDKGKHVLEMIFLGEDPKNRIDISSNKKAKPCMYVGTEKSTVLNLIADNSGRNQYKAIVDYVADSAKQFGIRYANTQTNEDIETQDKLLEFAKKQINDTPKTELDVNYIGYEKIEPRDSVFFVHELMGYNTELKVVKLDRSHPFVNAIDEVSFSNEIKDMVQIQQALNRRVIAQDNRYNYQANRINHLYTSTLNSPFETMDIGSVLI
10  c_000000000001  9551    10376   r   0   1   prodigal    v2.6.3  MQSFVKIIDGYKEEVITDFNQLIFLDARAESPNTNDNSVTINGVDGILPGAISFAPFSLVLRFGYDGIDVIDLNLFEHWFRSVFNRRHPYYVITSQMPGVKYAVNTANVTSNLKDGSSTEIEVSLNVYKGYSESVNWTDSEFLFDSNWMFENGIPLDFTPKYTHTSNQFTIWNGSTDTINPRFKHDLKILINLNGSGGFELVNYTTGDIFKYNKSIDKNTDFVLDGVYAYRDINRVGIDTNRGIITLAPGKNEFKIKGDVSDIKTTFKFPFIYR
11  c_000000000001  10375   14047   r   1   1   prodigal    v2.6.3  KNYLGSIGKSFKEKFSKDMKDGYKSLSDDDLLKVGVNKFKGFMQTMGTASKKASDTVKVLGKGVSKETEKALEKYVHYSEENNRIMEKVRLNSGQITEDKAKKLLKIEADLSNNLIAEIEKRNKKELEKTQELIDKYSAFDEQEKQNILTRTKEKNDLRIKKEQELNQKIKELKEKALSDGQISENERKEIEKLENQRRDITVKELSKTEKEQERILVRMQRNRNAYSIDEASKAIKEAEKARKARKKEVDKQYEDDVIAIKNNVNLSKSEKDKLLAIADQRHKDEVRKAKSKKDAVVDVVKKQNKDIDKEMDLSSGRVYKNTEKWWNGLKSWWSNFREDQKKKSDKYAKEQEETARRNRENIKKWFGNAWDGVKTKTGEAFSKMGRNANHFGGEMKKMWSGIKGIPSKLSSSWSSAKSSVGYHTKAIANSTGKWFGKAWQSVKSTTGSIYNQTKQKYSDASDKAWVHSKSIWRGTSKWFSNAYKSAKGWLTDMANKSRSKWDNISSTAWSNAKSVWKGTSKWFSNSYKSLKGWTGDMYSRAHDRFDAISSSAWSNAKSVFNGFRKWLSKTYDWIRDIGKDMGRAAADLGKNVANKAIGGLNSMIGGINKISKAITDKNLIKPIPTLSTGTLAGKGVATDNSGALTQPTFAVLNDRGSGNAPGGGVQEVIHRADGTFHAPQGRDVVVPLGVGDSVINANDTLKLQRMGVLPKFHGGTKKKKWMEQVTENLGKKAGDFGSKAKNTAHNIKKGAEEMVEAAGDKIKDGASWLGDKIGDVWDYVQHPGKLVNKVMSGLNINFGGGANATVKIAKGAYSLLKKKLVDKVKSWFEDFGGGGDGSYLFDHPIWQRFGSYTGGLNFNGGRHYGIDFGMPTGTNIYAVKGGIADKVWTDYGGGNSIQIKTGANEWNWYMHLSKQLVRQGQRIKAGQLIGKSGATGNFVRGAHLHFQLMQGSHPGNDTAKDPEKWLKSLKGSGVRSGSGVNKAASAWAGDIRRAAKRMGVNVTSGDVGNIISLIQHESGGNAGITQSSSLRDINVLQGNPAKGLLQYIPQTFRHYAVRGHNNIYSGYDQLLAFFNNRYWRSQFNPRGGWSPSGPRRYANGGLITKHQLAEVGEGDKQEMVIPLTRRKRAIQLTEQVMRIIGMDGKPNNITVNNDTSTVEKLLKQIVMLSDKGNKLTDALIQTVSSQENNLGSNDAIRGLEKILSKQSGHRANANNYMGGLTN
michaelkyu commented 2 years ago

The bug should be fixed now. Please update your repository with git pull and then reinstall with pip install /path/to/PlasX. Then, please rerun plasx search_de_novo_families... (you don't need to rerun plasx setup or other earlier steps).

This is the output I get. Is this what you see?

gene_callers_id contig  start   stop    direction       rev_compd       length  e_value accession
0       c_000000000001  271     829     r       True    558     0.0     mmseqs_5_34857857
3       c_000000000001  3386    3686    r       True    300     0.0     mmseqs_5_19291796
4       c_000000000001  3731    3896    r       True    165     1.289e-20       mmseqs_20_48600463
5       c_000000000001  3888    4278    r       True    390     2.193e-26       mmseqs_5_44369838
6       c_000000000001  4277    5777    r       True    1500    0.0     mmseqs_20_32489040
7       c_000000000001  5743    7654    r       True    1911    0.0     mmseqs_5_19647232
8       c_000000000001  7669    7960    r       True    291     0.0     mmseqs_5_19517056
9       c_000000000001  7959    9543    r       True    1584    0.0     mmseqs_5_20276345
9       c_000000000001  7959    9543    r       True    1584    0.0     mmseqs_5_34921539
10      c_000000000001  9551    10376   r       True    825     0.0     mmseqs_25_22345615
10      c_000000000001  9551    10376   r       True    825     0.0     mmseqs_20_22345615
10      c_000000000001  9551    10376   r       True    825     0.0     mmseqs_5_22651810
smb20200615 commented 2 years ago

Thank you so much. Do I need to rerun everything that ran successfully too?

michaelkyu commented 2 years ago

No problem, and thanks for bringing my attention to this bug!

You don't need to rerun the earlier commands that were successful. You can directly run and continue from plasx search_de_novo_families...