rvhonorato / cazy-parser

A way to extract specific information from CAZy
GNU General Public License v3.0
13 stars 8 forks source link

`create_cazy_db` fails #4

Closed lonsbio closed 5 years ago

lonsbio commented 7 years ago

Unable to create database on Python 2.7.13. Output (exlcucing BeautifulSoup warning) as follows:

>> Gathering species codes for species with full genomes
>> Glycoside-Hydrolases
>> 145 families found on http://www.cazy.org/Glycoside-Hydrolases.html
> GH1

then error

first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
ValueError: invalid literal for int() with base 10: 'GH1_archaea.html?debut_TAXO=100'

Has the pagination code changed for the expression to fail?

rvhonorato commented 7 years ago

Yes, looks like the pagination changed a bit. I did a quick fix using regular expressions #5 and it should work fine now. Thanks for opening this issue.

lonsbio commented 7 years ago

Thanks! I tried my own patch overnight (not as elegant) and it seemed to work too.

Also, I'm not sure if this is a recent issue or incidental. My DB download file seems to have newlines surrounding the organism field:

domain  protein_name    family  tag organism_code   ec  genbank uniprot subfamily   organism    pdb
     Ahos_0285  GH1     invalid     AEE93176.1          
Acidianus hospitalis W1
      

Fixing it does't seem to effect the extract script, but does make the csv (tsv) file readable. Is the wrapping intentional?