oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
428 stars 51 forks source link

Question regarding summary stats generation #312

Open fischer-hub opened 1 month ago

fischer-hub commented 1 month ago

Hi! Thanks for the nice work!

Im wondering how exactly the summary statistics are calculated for the id.txt output file. I ran bakta on some paratuberculosis assemblies and when I check the summary txt file in the bakta output I get for example ~100 hypotheticals but when I count the features in the .gff3 annotation output which are flagged as hypotgheticals in the attribute field I get 4-5 times more features. I'm wondering if there is some additional filter condition in the calculation of the statistics that bakta reports that I'm missing out on? I also have this issue with pseudogenes but the discrepancy is not as high here.

Thanks in advance!

oschwengers commented 4 weeks ago

Interesting! On a first glimpse into this, I have no idea where this might result from. Could you provide the log/summary/gff3 files to take a deeper look into this?

fischer-hub commented 3 weeks ago

Sure!

I don't know if I can share the log file since the sequencing data is not public yet, but maybe you can make sense of this already without it. Also had to rename the gff3 file for github to let me upload it :) 08MA0619.txt 08MA0619.gff3.txt

Now e.g. the summary.txt says there are 111 hypotheticals but when I run:

cat 08MA0619/08MA0619.gff3.txt | grep 'hypo' | wc -l

that returns 513 lines with hits in the annotation file.

For pseudogenes I get 9 from the summary and 14 from grep, for CDS it matches almost (only one count off). The issue is the biggest with the hypothetical features I think!

Thanks for looking into this!

oschwengers commented 2 weeks ago

Unfortunately, based on these two files, I have no clue what's going on. Could you either provide the log file (after executing Bakta with the --debug flag set), and the JSON file, or the genome sequence so that I can try to reproduce this myself? If you like, you could also send me the files via Email, if that's more comfortable for you in terms of "unpublished data".

fischer-hub commented 2 weeks ago

I sent you the sequence file via email to the address from your github profile! Hope this helps!

oschwengers commented 2 weeks ago

Thanks, I received your genome and tested it with Bakta v1.9.3 and DB v5.1 versions. However, I cannot reproduce your results / this issue. Bakte states 170 proteins that are marked as hypothetical protein which is exactly the number of such proteins I get via grep:

Bakta v1.9.3
Options and arguments:
    input: /home/oliver/tmp/bakta-gh-312/08MA0619.fa
    db: /home/oliver/tmp/databases/bakta/db-v5.1, version 5.1, full
...
annotation summary:
    tRNAs: 47
    tmRNAs: 1
    rRNAs: 3
    ncRNAs: 17
    ncRNA regions: 15
    CRISPR arrays: 0
    CDSs: 4510
        hypotheticals: 170
...

and

$ grep -c hypothetical test/test.gff3 
170

Any idea?

fischer-hub commented 6 days ago

I still get even different bakta summary stats for this sample:

cat 08MA0619/08MA0619.txt 
Sequence(s):
Length: 4790403
Count: 125
GC: 69.3
N50: 92459
N ratio: 0.0
coding density: 90.9

Annotation:
tRNAs: 47
tmRNAs: 1
rRNAs: 3
ncRNAs: 17
ncRNA regions: 15
CRISPR arrays: 0
CDSs: 4510
pseudogenes: 9
hypotheticals: 111
signal peptides: 0
sORFs: 1
gaps: 0
oriCs: 2
oriVs: 0
oriTs: 0

Bakta:
Software: v1.9.3
Database: v5.1, full
DOI: 10.1099/mgen.0.000685
URL: github.com/oschwengers/bakta

but with grep:

$ grep -c hypothetical 08MA0619/08MA0619.gff3 
512

Even though this is the same bakta and DB version right? Is this maybe a grep issue? Which grep version did you use? This is so confusing

$ grep --version
grep (GNU grep) 3.4

Although the different Bakta summary count could maybe be caused by me using the --protein flag with the attached protein file? But still doesnt explain the grep issue hmm

GCA_025665475_1_genomic.gbff.txt