Open fischer-hub opened 1 month ago
Interesting! On a first glimpse into this, I have no idea where this might result from. Could you provide the log/summary/gff3 files to take a deeper look into this?
Sure!
I don't know if I can share the log file since the sequencing data is not public yet, but maybe you can make sense of this already without it. Also had to rename the gff3
file for github to let me upload it :)
08MA0619.txt
08MA0619.gff3.txt
Now e.g. the summary.txt says there are 111 hypotheticals but when I run:
cat 08MA0619/08MA0619.gff3.txt | grep 'hypo' | wc -l
that returns 513 lines with hits in the annotation file.
For pseudogenes I get 9 from the summary and 14 from grep, for CDS it matches almost (only one count off). The issue is the biggest with the hypothetical features I think!
Thanks for looking into this!
Unfortunately, based on these two files, I have no clue what's going on. Could you either provide the log file (after executing Bakta with the --debug
flag set), and the JSON file, or the genome sequence so that I can try to reproduce this myself? If you like, you could also send me the files via Email, if that's more comfortable for you in terms of "unpublished data".
I sent you the sequence file via email to the address from your github profile! Hope this helps!
Thanks, I received your genome and tested it with Bakta v1.9.3
and DB v5.1
versions. However, I cannot reproduce your results / this issue. Bakte states 170 proteins that are marked as hypothetical protein
which is exactly the number of such proteins I get via grep
:
Bakta v1.9.3
Options and arguments:
input: /home/oliver/tmp/bakta-gh-312/08MA0619.fa
db: /home/oliver/tmp/databases/bakta/db-v5.1, version 5.1, full
...
annotation summary:
tRNAs: 47
tmRNAs: 1
rRNAs: 3
ncRNAs: 17
ncRNA regions: 15
CRISPR arrays: 0
CDSs: 4510
hypotheticals: 170
...
and
$ grep -c hypothetical test/test.gff3
170
Any idea?
I still get even different bakta summary stats for this sample:
cat 08MA0619/08MA0619.txt
Sequence(s):
Length: 4790403
Count: 125
GC: 69.3
N50: 92459
N ratio: 0.0
coding density: 90.9
Annotation:
tRNAs: 47
tmRNAs: 1
rRNAs: 3
ncRNAs: 17
ncRNA regions: 15
CRISPR arrays: 0
CDSs: 4510
pseudogenes: 9
hypotheticals: 111
signal peptides: 0
sORFs: 1
gaps: 0
oriCs: 2
oriVs: 0
oriTs: 0
Bakta:
Software: v1.9.3
Database: v5.1, full
DOI: 10.1099/mgen.0.000685
URL: github.com/oschwengers/bakta
but with grep:
$ grep -c hypothetical 08MA0619/08MA0619.gff3
512
Even though this is the same bakta and DB version right? Is this maybe a grep issue? Which grep version did you use? This is so confusing
$ grep --version
grep (GNU grep) 3.4
Although the different Bakta summary count could maybe be caused by me using the --protein
flag with the attached protein file? But still doesnt explain the grep issue hmm
Hi! Thanks for the nice work!
Im wondering how exactly the summary statistics are calculated for the
id.txt
output file. I ran bakta on some paratuberculosis assemblies and when I check the summary txt file in the bakta output I get for example ~100 hypotheticals but when I count the features in the.gff3
annotation output which are flagged as hypotgheticals in the attribute field I get 4-5 times more features. I'm wondering if there is some additional filter condition in the calculation of the statistics that bakta reports that I'm missing out on? I also have this issue with pseudogenes but the discrepancy is not as high here.Thanks in advance!