openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
374 stars 65 forks source link

pyensembl index does not have the same effect as deleting all DB files and re-installing #82

Closed tavinathanson closed 7 years ago

tavinathanson commented 9 years ago

Namely, I got this when I tried to install release 75:

~/drive/work/repos/cancer/nejm $ pyensembl index --release 75
INFO:root:Cached file Homo_sapiens.GRCh37.75.gtf from URL ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
Creating database: /Users/tavi/Library/Caches/ensembl/Homo_sapiens.GRCh37.75.db
Reading Dataframe from /Users/tavi/Library/Caches/ensembl/Homo_sapiens.GRCh37.75.gtf.expanded.csv
/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/pandas/io/parsers.py:1159: DtypeWarning: Columns (0,18) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
WARNING:root:Failed to create tables [nan, 'start_codon', 'Selenocysteine', 'UTR', 'exon', 'stop_codon', 'CDS', 'gene', 'transcript'] in database /Users/tavi/Library/Caches/ensembl/Homo_sapiens.GRCh37.75.db
Traceback (most recent call last):
  File "/Users/tavi/.virtualenvs/nejm/bin/pyensembl", line 9, in <module>
    load_entry_point('pyensembl==0.6.2', 'console_scripts', 'pyensembl')()
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/pyensembl/shell.py", line 57, in run
    ensembl.index()
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/pyensembl/ensembl_release.py", line 207, in index
    self.db.create(force=force)
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/pyensembl/database.py", line 540, in create
    self._create_database(force=force)
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/pyensembl/database.py", line 186, in _create_database
    version=DATABASE_SCHEMA_VERSION)
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/datacache/database_helpers.py", line 200, in db_from_dataframes
    version=version)
  File "/Users/tavi/.virtualenvs/nejm/lib/python2.7/site-packages/datacache/database_helpers.py", line 104, in _create_cached_db
    ", ".join(table_names))
TypeError: sequence item 0: expected string, float found

But when I deleted files and started over, it was fine.

index should basically be the same as starting over with the DB.

arahuja commented 9 years ago

Maybe a similar error:

ValueError                                Traceback (most recent call last)
<ipython-input-16-322f1ceac2e4> in <module>()
      2 
      3 ensembl = EnsemblRelease(75)
----> 4 EnsemblRelease(75).install()

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/pyensembl-0.6.7-py2.7.egg/pyensembl/ensembl_release.pyc in install(self)
    202         """
    203         self.download(force=False)
--> 204         self.index(force=False)
    205 
    206     def index(self, force=True):

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/pyensembl-0.6.7-py2.7.egg/pyensembl/ensembl_release.pyc in index(self, force)
    216         """
    217         self.db.create(force=force)
--> 218         self.transcript_sequences.index(force=force)
    219         self.protein_sequences.index(force=force)
    220 

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/pyensembl-0.6.7-py2.7.egg/pyensembl/sequence_data.pyc in index(self, force)
    200                 # below
    201                 try:
--> 202                     self._fasta_dictionary = self._create_or_open_fasta_db()
    203                     return
    204                 except ValueError as e:

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/pyensembl-0.6.7-py2.7.egg/pyensembl/sequence_data.pyc in _create_or_open_fasta_db(self)
    172             self.local_database_path,
    173             self.local_fasta_path,
--> 174             "fasta")
    175 
    176     def index(self, force=False):

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.pyc in index_db(index_filename, filenames, format, alphabet, key_function)
    889     >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"]
    890     >>> def get_gi(name):
--> 891     ...     parts = name.split("|")
    892     ...     i = parts.index("gi")
    893     ...     assert i != -1

/hpc/users/ahujaa01/anaconda/lib/python2.7/site-packages/Bio/File.pyc in __init__(self, index_filename, filenames, proxy_factory, format, key_function, repr, max_open)
    487             self._length = int(count)
    488             if self._length == -1:
--> 489                 con.close()
    490                 raise ValueError("Unfinished/partial database")
    491             count, = con.execute(

ValueError: Index file has different filenames

@tavinathanson how/where can I delete all the relavant files?

iskandr commented 9 years ago

@arahuja You can nuke all the cached data by calling EnsemblRelease().clear_cache().

Did this error happen after a PyEnsembl upgrade?

I think the culprit is that I'm relying on BioPython to create the FASTA database (instead of going through datacache), though I'm not totally sure what it means by "Index file has different filenames".

arahuja commented 9 years ago

@iskandr Yea I seem to have ended up in a very environment by trying to switch between Python 3 and 2. I found the files in ~/.cache and removed them.

iskandr commented 7 years ago

Has this come up in more recent versions of PyEnsembl? Seems like an old bug.

arahuja commented 7 years ago

Haven't seen it recently, but also have not had to re-index/re-install genome builds in a while

iskandr commented 7 years ago

Since I have no idea how to recreate this issue and it might not happen any more, closing for now.