nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
301 stars 82 forks source link

Excessive file generation #853

Open drabe004 opened 1 year ago

drabe004 commented 1 year ago

Hi! I recently ran funannotate compare with 14 genomes and it generated ~3.1mil files (and then was killed because our group hit the 15 mil file limit for our cluster allocation).

I'd like to a) be able to finish this run b) do a run with my whole dataset (126 genomes)

the latter will certainly generate more files than our 15 mil limit.

I'm wondering if there is any way we can get around this issue and if funannotate can be altered in any way to either auto-delete files or if there is a part of the pipeline we can write to S3 storage?

Any advice would be tremendously helpful!

Thanks so much in advance for your time!

nextgenusfs commented 1 year ago

Hi @drabe004. Sorry to hear you are getting in trouble on your cluster! The most genomes I've ever run through compare is probably like 12. The script was really written when I initially wrote funannotate to analyze about 6 genomes. It could probably be re-written to do more processing in memory instead of writing to file.

What part of compare are you most interested in? The html output with that many genomes won't be useful and will hardly be readable I would think.

drabe004 commented 1 year ago

Hi Jon!

Thanks for getting back to me! I was looking to use it more like BUSCO, to compare the quality of all the genomes we are using in an analysis. 14 of those are our own, and then the rest are on public databases. I was wondering if it gave more detail than BUSCO, but I wasn't sure from the tutorial description. We are running it on Beeond which allows us to write to scratch. Not sure its worth the effort though (sounds like BUSCO might be a better option?)

On Thu, Feb 2, 2023 at 6:11 PM Jon Palmer @.***> wrote:

Hi @drabe004 https://github.com/drabe004. Sorry to hear you are getting in trouble on your cluster! The most genomes I've ever run through compare is probably like 12. The script was really written when I initially wrote funannotate to analyze about 6 genomes. It could probably be re-written to do more processing in memory instead of writing to file.

What part of compare are you most interested in? The html output with that many genomes won't be useful and will hardly be readable I would think.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1414536401, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAB7EAS3DIGNLVTIXGLWVRELHANCNFSM6AAAAAATZSH664 . You are receiving this because you were mentioned.Message ID: @.***>

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers

drabe004 commented 8 months ago

Hi Jon!

I thought I'd reach out again as I'm having some issues with the funnannotate file formatting.

It gives me a nice GBK file with all the things in it. but when I run utils the parts files it gives are not in NCBI format and there seems to be no CDS file.

What I really need is to be able to take my gbk files and generate NCBI formatted files for CDS (and eventually also protein transcripts etc). I'm trying to write my own python script to do this and... its not going wonderfully well... I'm wondering if there is something you might know about or have that would help me in this endeavor.

Thanks so much for your help!

(Script attempts attached ) -- my script only extracted 279 CDS which... is wrong. (yikes).

~Danielle

On Fri, Feb 3, 2023 at 12:19 PM Danielle Drabeck @.***> wrote:

Hi Jon!

Thanks for getting back to me! I was looking to use it more like BUSCO, to compare the quality of all the genomes we are using in an analysis. 14 of those are our own, and then the rest are on public databases. I was wondering if it gave more detail than BUSCO, but I wasn't sure from the tutorial description. We are running it on Beeond which allows us to write to scratch. Not sure its worth the effort though (sounds like BUSCO might be a better option?)

On Thu, Feb 2, 2023 at 6:11 PM Jon Palmer @.***> wrote:

Hi @drabe004 https://github.com/drabe004. Sorry to hear you are getting in trouble on your cluster! The most genomes I've ever run through compare is probably like 12. The script was really written when I initially wrote funannotate to analyze about 6 genomes. It could probably be re-written to do more processing in memory instead of writing to file.

What part of compare are you most interested in? The html output with that many genomes won't be useful and will hardly be readable I would think.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1414536401, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAB7EAS3DIGNLVTIXGLWVRELHANCNFSM6AAAAAATZSH664 . You are receiving this because you were mentioned.Message ID: @.***>

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers

nextgenusfs commented 8 months ago

Hi @drabe004 . What do you mean by "NCBI formatted files for CDS"?

drabe004 commented 8 months ago

Yes , for example if I want the CDS from the refseq FTP I would

Wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.6_GRCz11/GCF_000002035.6_GRCz11_cds_from_genomic.fna.gz

And that file (and all CDS files from genbank) are formatted in the same way. (My script just grabbed one feature). But seems likely there is a standard gbk-> CDS script somewhere.

On Tue, Nov 7, 2023 at 3:39 PM Jon Palmer @.***> wrote:

Hi @drabe004 https://github.com/drabe004 . What do you mean by "NCBI formatted files for CDS"?

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1800207397, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMADBM56V6FHGZ36EBU3YDKTATAVCNFSM6AAAAAATZSH666VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGIYDOMZZG4 . You are receiving this because you were mentioned.Message ID: @.***>

nextgenusfs commented 8 months ago

Genbank (NCBI) format is highly problematic for alternative transcripts -- there is actually nothing in the format to link an mRNA features to a CDS feature.

but that isn't really an issue here -- in the output directory of funannotate there should be (by default) protein, transcripts, cds-transcript FASTA files for the annotation. The headers are formatted like >[transcript id] [gene id], so you could simply pull out the sequences you are interested in with some existing tools, ie seqkit https://bioinf.shenwei.me/seqkit/, something like this to get all the sequences from the "GRCz11" gene:

$ seqkit grep -r -p GRCz11 cds-transcripts.fasta
drabe004 commented 8 months ago

Hmmm yes I see those files there but the headers are all formatted as FUN000001 And then there is a key file (something like fun-gene-names) that tells me which Funnanoate gene name corresponds to gene name, protein name etc. (Sorry I’m paraphrasing because I’m away from my computer) I suppose I can write a script to replace the headers from that key file.

But… Am I missing something?

On Tue, Nov 7, 2023 at 4:48 PM Jon Palmer @.***> wrote:

Genbank (NCBI) format is highly problematic for alternative transcripts -- there is actually nothing in the format to link an mRNA features to a CDS feature.

but that isn't really an issue here -- in the output directory of funannotate there should be (by default) protein, transcripts, cds-transcript FASTA files for the annotation. The headers are formatted like >[transcript id] [gene id], so you could simply pull out the sequences you are interested in with some existing tools, ie seqkit https://bioinf.shenwei.me/seqkit/, something like this to get all the sequences from the "GRCz11" gene:

$ seqkit grep -r -p GRCz11 cds-transcripts.fasta

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1800321589, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMADOLHEDQRTRLJFYZ2TYDK3DBAVCNFSM6AAAAAATZSH666VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGMZDCNJYHE . You are receiving this because you were mentioned.Message ID: @.***>

nextgenusfs commented 8 months ago

Oh, yeah you are right... However, I'd caution against ever using common names and not locus_tags as its a recipe for disaster. locus_tags are unique while common names inferred by homology are inherently problematic.

The annotation is also in the GFF3 output, let me think a little bit about how to implement a search like functionality. All the GFF3 parsing and converting is now in a separate package GFFtk https://github.com/nextgenusfs/gfftk. Perhaps it would be possible to implement some sort of search-like filtering scheme to pull out relevant genes.

drabe004 commented 8 months ago

OK, yes, this is the issue I'm coming up against.

I will say that the NCBI CDS format retains both the genetag and the product, protein, etc... in the header, so it does allow you to grep whatever you like for each sequence, and keep the individual locus tag if needed.

This is the format ">lcl|VCAZ01000103.1_cds_TSS60346.1_1 [locus_tag=Baya_12048] [db_xref=UniProtKB/Swiss-Prot:O15068] [protein=Guanine nucleotide exchange factor DBS] [protein_id=TSS60346.1] [location=join(98180..98273,108239..108307,110609..110723,111040..111130,113133..113252,113550..113903,114876..114997,115288..115402,115488..115607,117258..117440,117544..117715,118851..118927,119020..119102,120551..120617,123361..123444,123554..123646,124755..124847,125041..125166,125381..125515,129150..129242,129337..129415,129497..129560,131699..131769,133527..133627,133715..133827,134030..134153)] [gbkey=CDS] ATGGGCCCAGGCTGTCCTCTCCAGCCTCAGAGAGACATGGAAGGCTATTACAGCCTTCTGCAGGCTGGCACTGTGCTGGA GAACACACTGCAGCAGGTGTCTATTCCTGTGAGTATGAAAGAAGTCGCCGGCTTTATCGAGAAACAAGTGGCTTATTTAT CAGGTGGCCGTGGCGAGGATTCCAGTGTTGTCATTACTCTTCCAGAGTGCTCAGATTTCAGTGACATCCCAGAGGAAGCG CTGGCTAAAGTCCTGACGTATCTCACACTGATTCCACGAGCAAGACAACCTGGAGTCAAATTTATCATTATTCTGGACCG A....."

It might help to say what I'm trying to do is to make a custom blast database for each genome, so that I can blast each sequence from a set of alignments back to the original genomes CDS/transcripts ---- so for each alignment I have blast output that I can use to assign headers to each sequence in each alignment (these are from Orthofinder--orthosnap). IN other words, its a way of turning protein alignments into nucleotide alignments (that each have annotations from the genome of origin).

I was exploring using the GFF3 file but this didn't seem to have a join list for all the exons in a CDS whereas the genbank file does... so I thought I'd run into issues where I'd have exon 1-5 separated out into 5 CDS sequences rather than one joined sequence.

I'll do some exploring of https://github.com/nextgenusfs/gfftk and see if there is something in there that comes close to what I'm looking for! Certainly let me know if you can think of a solution! Thanks so much!

~Danielle

On Tue, Nov 7, 2023 at 5:10 PM Jon Palmer @.***> wrote:

Oh, yeah you are right... However, I'd caution against ever using common names and not locus_tags as its a recipe for disaster. locus_tags are unique while common names inferred by homology are inherently problematic.

The annotation is also in the GFF3 output, let me think a little bit about how to implement a search like functionality. All the GFF3 parsing and converting is now in a separate package GFFtk https://github.com/nextgenusfs/gfftk. Perhaps it would be possible to implement some sort of search-like filtering scheme to pull out relevant genes.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1800348149, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMABH2H7ETKOTTAOGQ2DYDK5XHAVCNFSM6AAAAAATZSH666VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGM2DQMJUHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers

nextgenusfs commented 8 months ago

It's reasonable to add more information to the FASTA header, I'll likely implement this in GFFtk as that is what the "new" funannotate2 uses in the backend.

drabe004 commented 8 months ago

Great!

Looking at the -h for gfftk

would this look something like

gfftk filename.gff3 --convert "fasta"

1) how do I tell it what kind of fasta I want and what headers to retain?

usage: gfftk [-h] [--version] {consensus,convert,sort,sanitize,rename,stats,compare} ...

GFFtk: tool kit for GFF3 genome annotation manipulation

Commands: consensus EvidenceModeler-like tool to generate consensus gene predictions. convert convert GFF3/tbl format into another format [output gff3, gtf, tbl, gbff, fasta]. sort sort GFF3 file properly [maintain feature order: gene, mrna, exon, cds]. sanitize sanitize GFF3 file, load GFF3 and output cleaned up GFF3 output. rename rename gene models in GFF3 annotation file. stats parse annotation GFF3/tbl and output summary statistics. compare compare two GFF3 annotations of a genome.

Help: -h, --help Show this help message and exit --version show program's version number and exit

On Wed, Nov 8, 2023 at 10:11 AM Jon Palmer @.***> wrote:

It's reasonable to add more information to the FASTA header, I'll likely implement this in GFFtk as that is what the "new" funannotate2 uses in the backend.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/853#issuecomment-1802210632, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAE2NFBI7WNVGTSI67LYDOVMPAVCNFSM6AAAAAATZSH666VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGIYTANRTGI . You are receiving this because you were mentioned.Message ID: @.***>

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers