nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 87 forks source link

Issues downloading databases (Pfam and interpro) #720

Open ferninfm opened 2 years ago

ferninfm commented 2 years ago

Hi,

Sorry that I do not have time to make this properly, meditate and push useful changes to the code. There are some issues in setupDB.py that do not allow it to work properly when downloading the Pfam and interpro databases. Because I was in a hurry I used a temporary solution editing setupDB.py... In general it is not very useful, but after quite some time I did not manage to solve the problems in a useful way.

A useful edit:

1) Change line 470 from the deprecated (python 3.9)for x in elem.getchildren(): to for x in list(elem): https://github.com/nextgenusfs/funannotate/blob/6d098c265e83b63c42918b486b9324ba4beb3b87/funannotate/setupDB.py#L470

Some problems I could not solve properly:

2) The system used to download the databases does not work when using ftp://ftp.server/file. The address gets consistently changed to http://ftp.server/file no matter what is writen in DBURL of resources.py. It must have to do with DBURL.get and not wgrep. I solved the situation by imposing a string value.

3) The date of the interpro datafile is different this time around. The month is capitalized. So is not %d-%b-%Y (10-Mar-22) but %d-%^b-%Y (10-MAR-22). I tried to change both instances of %b to %^b, but this did not work.

https://github.com/nextgenusfs/funannotate/blob/6d098c265e83b63c42918b486b9324ba4beb3b87/funannotate/setupDB.py#L477 https://github.com/nextgenusfs/funannotate/blob/6d098c265e83b63c42918b486b9324ba4beb3b87/funannotate/setupDB.py#L480

Same as before I edited setupDB.py imposing a hard string value which did the job (10-Mar-22).... Now I can annotate.

Cheers, Thanks for everything

def interproDB(info, force=False, args={}):
    iprXML = os.path.join(FUNDB, 'interpro.xml')
    iprTSV = os.path.join(FUNDB, 'interpro.tsv')
    if os.path.isfile(iprXML) and args.update and not force:
        if check4newDB('interpro', info):
            force = True
    if not os.path.isfile(iprXML) or force:
        lib.log.info('Downloading InterProScan Mapping file')
        for x in [iprXML, iprTSV, iprXML+'.gz']:
            if os.path.isfile(x):
                os.remove(x)
        if args.wget:
            wget('ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz', iprXML+'.gz')
            wget('ftp://ftp.ebi.ac.uk/pub/databases/interpro/entry.list', iprTSV)
        else:
            download('ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz', iprXML+'.gz')
            download('ftp://ftp.ebi.ac.uk/pub/databases/interpro/entry.list', iprTSV)
        md5 = calcmd5(iprXML+'.gz')
        subprocess.call(['gunzip', '-f', 'interpro.xml.gz'],
                        cwd=os.path.join(FUNDB))
        num_records = ''
        version = ''
        iprdate = ''
        for event, elem in cElementTree.iterparse(iprXML):
            if elem.tag == 'release':
                for x in list(elem):
                    if x.attrib['dbname'] == 'INTERPRO':
                        num_records = int(x.attrib['entry_count'])
                        version = x.attrib['version']
                        iprdate = x.attrib['file_date']
        try:
            iprdate = "10-Mar-22"
        except ValueError:
            iprdate = datetime.datetime.strptime(
                iprdate, "%d-%b-%Y").strftime("%Y-%m-%d")
        info['interpro'] = ('xml', iprXML, version, iprdate, num_records, md5)
    type, name, version, date, records, checksum = info.get('interpro')
    lib.log.info('InterProScan XML: version={:} date={:} records={:,}'.format(
        version, date, records))
xvazquezc commented 2 years ago

The installation instructions specify Python =>3.6,<3.9, thus 3.9 is not supported. I got the Interpro db from the same date and it downloaded without issues not more than a few weeks ago.

ferninfm commented 2 years ago

1) I also thought interpro was being upgraded, but it wasn't. There is no error collection in process.call. The download process is slow compared to using wget on the terminal and finishes with an broken file. Then there is no gunzip bit the old XML remains. As long as the old interpro.xml is present the update process continues. It is only a problem when using -force. It simply does not update.

The GitHub version of funnanotate has compatibility issues with python 3.8 which is why I use a fresh conda environment with python 3.9

Irrespective of this. Why does .get modify the specified string to http:// from ftp:// ? That seems undesirable? I modified the resources file to asses the behaviour.

xvazquezc commented 2 years ago

Now you mention it, I had issues with 3.8 too, but I ended installing 3.7.10 and everything works (in case you want to try). Can't comment on the other stuff though.

ferninfm commented 2 years ago

It is ok, as I said I cheated to get the problem solved and it simply worked...