rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
11 stars 0 forks source link

Use 1 million sequences rather than 100.000 to find any overrepresented ones. #23

Closed rhpvorderman closed 1 year ago

rhpvorderman commented 1 year ago

Checklist

The previous implementation used a separate hash array and entry array. Each occupied hash would correspond with an occupied entry array at the same spot.

This PR introduces a entry_index array containing uint32_t indexes. Each occupied hash corresponds with an entry_index in the entry_index array. The entry_index then corresponds with an entry in the entry array. This allows the entry_array to simple fill up from index 0 to MAX_UNIQUE_SEQUENCES -1. A similar dictionary scheme is used in CPython. This preserves order of insertion as well as taking up significantly less memory.