rtviii / riboxyz

htpps://ribosome.xyz
0 stars 0 forks source link

Streamline accesses to `ncbi` taxonomy db. #61

Closed rtviii closed 5 months ago

rtviii commented 5 months ago

You put check_same_thread=False into ncbiquery.py in ete3 once upon a time and now every 5th process in the pool gets confused. Damage not clear, profiles seem to save correctly, but we gotta roll this doodoo back.

    # ncbiquery.py
    def _connect(self):
        self.db = sqlite3.connect(self.dbfile, check_same_thread=False)
Traceback (most recent call last):
  File "/home/rtviii/dev/riboxyz/ribctl/etl/etl_pipeline.py", line 675, in process_structure
    protein_classifier.classify_chains()
  File "/home/rtviii/dev/riboxyz/ribctl/lib/libhmm.py", line 270, in classify_chains
    hmmscanner                             = HMMs(organism_taxid, self.candidate_classes, no_cache = True, max_seed_seqs = 5)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribctl/lib/libhmm.py", line 175, in __init__
    loaded_results.append(future.result())
                          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribctl/lib/libhmm.py", line 163, in __load_seed_sequences
    return (candidate_class, [*fasta_phylogenetic_correction(candidate_class, tax_id, max_n_neighbors=max_seed_seqs)] )
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribctl/lib/libhmm.py", line 54, in fasta_phylogenetic_correction
    phylo_nbhd = phylogenetic_neighborhood(list(map(lambda x: str(x),ids)), str(organism_taxid), max_n_neighbors)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribctl/lib/libmsa.py", line 162, in phylogenetic_neighborhood
    tree               = ncbi.get_topology(list(set([*taxids_base, str(taxid_target)])))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribxzvenv/lib/python3.12/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 485, in get_topology
    self.annotate_tree(tree)
  File "/home/rtviii/dev/riboxyz/ribxzvenv/lib/python3.12/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 518, in annotate_tree
    tax2name = self.get_taxid_translator(taxids)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/riboxyz/ribxzvenv/lib/python3.12/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 270, in get_taxid_translator
    for tax, spname in result.fetchall():
        ^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)
rtviii commented 5 months ago

Actually, let's see if we can figure out how to make the sqlite bit multithreaded/mmapped? https://www.sqlite.org/inmemorydb.html