leapicard / DGINN

14 stars 7 forks source link

Local BLAST db possible? #13

Open Shellfishgene opened 2 years ago

Shellfishgene commented 2 years ago

Hi!

The paper suggests local custom BLAST databases are possible to use, and the code seems to support them, however the comment in the parameters example talks about future versions. When I try it with a custom database I get an error regarding UnboundLocalError: local variable 'nbSeq' referenced before assignment in this line in BlastFunc.py, which makes sense as that variable seems not to be set when using local BLAST. I could maybe fix that, but I'm wondering if there will be problems later depening on the formatting of the sequence IDs in my database?

Background is that we have 5 species that we would like to run DGINN on, but they are not public in an NCBI database yet.

leapicard commented 2 years ago

Hello,

I would like to include the possibility to work with local databases but haven't had the occasion to yet. I don't have the kind of data needed to develop and test against, which is the main difficulty. Also, we would need to include the necessary files to enter the taxonomic information for each sequence in order for the downstream steps for duplication events to run properly (a simple file associating each sequence identifier to the full species name would probably be enough), and the species in question would have to be properly included in the species tree provided for the tree reconciliation. I'd be happy to collaborate on this since you appear to have the kind of database needed for development. Do you only want to interrogate your local database? 5 species would probably not provide enough data for robust positive selection analyses, do you intend to include other species from publically available databases?

Shellfishgene commented 2 years ago

I already saw the sequence retrievel part for local databases needs to be implemented. I'll have a look in more detail after Christmas! The more lazy way might be to require the species name to be part of the sequence IDs, then no extra file would be needed, but it's a bit more work to make the BLAST DB. We'll add some more species to our dataset, haven't figured out exactly which ones yet. Happy holidays!

Shellfishgene commented 2 years ago

I cobbled something together now to make it work, but did not test much yet. I created my BLAST DB by with renaming all my cDNAs to SpeCie_geneID and then added the following to FastaResFunc.py:

def localDl(lBlastRes, queryName, datadir, blastDB):
    logger = logging.getLogger("main.accessions")

    batchFile = datadir+"batch.txt"
    with open(batchFile, 'w') as f:
        for item in lBlastRes:
            f.write("%s\n" % item)

    cmd = '%s %s %s %s %s' % ('blastdbcmd -entry_batch', batchFile, '-db', blastDB, '-outfmt \'%a %s\'')
    logger.debug(cmd)
    output = os.popen(cmd).read()
    dId2Seq = dict(x.split(" ") for x in filter(None, output.split("\n")))

    logger.info("Remote option off, used blastdbcmd to get sequences from blastdb.")

    return(dId2Seq)

And some other minor changes. It's not very nice and needs more adaptation to make remote work in the same version, for example the parseBlast routine, which now does not work for remote in my version. Also I don't really know Python...

fuesseler commented 6 months ago

Hello! Chiming in here, as my cluster does not allow me remote access to websites when running jobs, so I am forced to find an alternative solution for step 1 (also the species I am focusing on are not on NCBI yet). Has there been any progress with being able to specify a local database for BLAST?

Currently I am considering just running the local BLAST separately and then providing the results as NCBI tabulated format (tsv). Would it be possible for you to upload the corresponding file from the "example" (and maybe also later in-between files for entry steps), so I can cross-check whether my formatting is correct?

Alternatively I was also thinking of a work-around of calling orthologs from my genome annotations using proteinortho6 and then entering the DGINN pipeline with the CDS from those orthologs at step3. Do you think it would be problematic if then so-to-say the orthologous group determination is done "twice" as DGINN does it again in step 5? This would probably be my preferred option to go, rather than starting from BLAST, but I am unsure if it's problematic.

Best regards, F

leapicard commented 6 months ago

Hello,

Sorry, no, using local databases has not been implemented. Unless someone wants to take lead on that, I don't think we would have time to devote to it as well.

Either options you have proposed should work, though the recent migration to snakemake still needs to be tested for entry at downstream steps so there might be some hiccups.

For the one using proteinortho6, if you know all your sequences to be actual orthologs you don't need to perform the duplication step with DGINN and can just turn it off with a False flag.

Lea

On Wed, Apr 17, 2024, 10:26 PM FU @.***> wrote:

Hello! Chiming in here, as my cluster does not allow me remote access to websites when running jobs, so I am forced to find an alternative solution for step 1 (also the species I am focusing on are not on NCBI yet). Has there been any progress with being able to specify a local database for BLAST?

Currently I am considering just running the local BLAST separately and then providing the results as NCBI tabulated format (tsv). Would it be possible for you to upload the corresponding file from the "example" (and maybe also later in-between files for entry steps), so I can cross-check whether my formatting is correct?

Alternatively I was also thinking of a work-around of calling orthologs from my genome annotations using proteinortho6 and then entering the DGINN pipeline with the CDS from those orthologs at step3. Do you think it would be problematic if then so-to-say the orthologous group determination is done "twice" as DGINN does it again in step 5? This would probably be my preferred option to go, rather than starting from BLAST, but I am unsure if it's problematic.

Best regards, F

— Reply to this email directly, view it on GitHub https://github.com/leapicard/DGINN/issues/13#issuecomment-2061258953, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFEM5YXDG54DWQ7S7DGC2TY5ZZ75AVCNFSM5KSM3X2KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGEZDKOBZGUZQ . You are receiving this because you commented.Message ID: @.***>