kbaseapps / GenomeFileUtil

MIT License
0 stars 16 forks source link

Replace indexed based lookup with dictionary lookup #199

Closed scanon closed 2 years ago

scanon commented 2 years ago

This is to address a performance issue that was seen when uploading large metagenomes. The FastaGFF uploader using an index call to get the contig_length. This winds up being slow when there are 10M plus contigs. Fortunately there was already a dict of the same information. I'm not sure why that wasn't used in the first place. Testing shows this gives linear performance even for very large metagenomes.

codecov[bot] commented 2 years ago

Codecov Report

Merging #199 (bee2c55) into master (0b0b4f2) will decrease coverage by 6.33%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #199      +/-   ##
==========================================
- Coverage   74.41%   68.08%   -6.34%     
==========================================
  Files          11       11              
  Lines        2877     2886       +9     
==========================================
- Hits         2141     1965     -176     
- Misses        736      921     +185     
Impacted Files Coverage Δ
lib/GenomeFileUtil/core/FastaGFFToGenome.py 88.54% <100.00%> (-0.80%) :arrow_down:
lib/GenomeFileUtil/core/GenomeToGFF.py 29.69% <0.00%> (-46.67%) :arrow_down:
lib/GenomeFileUtil/GenomeFileUtilImpl.py 49.47% <0.00%> (-23.35%) :arrow_down:
lib/GenomeFileUtil/core/GenomeToGenbank.py 89.44% <0.00%> (-6.12%) :arrow_down:
lib/GenomeFileUtil/core/GenbankToGenome.py 89.19% <0.00%> (-4.29%) :arrow_down:
lib/GenomeFileUtil/core/GenomeInterface.py 79.44% <0.00%> (-0.35%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0b0b4f2...bee2c55. Read the comment docs.

Tianhao-Gu commented 2 years ago

👍 LGTM