repotrial / nedrexdb

A repository containing the code for building and hosting NeDRexDB
GNU General Public License v3.0
1 stars 3 forks source link

Biopython errors in parsing UniProt #11

Open james-skelton opened 2 years ago

james-skelton commented 2 years ago

Biopython is unable to parse the latest version of UniProt. This is a known issue and is due to a change in the feature (FT) lines of the UniProt file.

See issue #4021 from the biopython GitHub repository: https://github.com/biopython/biopython/issues/4021

james-skelton commented 2 years ago

The implications of this are that:

  1. We will need to revert to a previous version of UniProt for now -- downloading now.
  2. We will need to wait for biopython to be updated before we can integrate newer versions of UniProt OR change the version of UniProt we integrate (e.g., integrate the XML version).
james-skelton commented 2 years ago

Previous versions of UniProt don't seem to have the separate files for humans, so I've been trying to generate these myself.

Unfortunately, the TREMBL file for the 2022_02 release of UniProt encounters an unexpected EOF after parsing ~133,000,000 records. I'm not sure why this is the case, but it's quite slow to debug. I'm considering downloading the UniProt XML file to use for the UniProt parser, however this will involve rewriting the logic as the properties parsed and their names differ.