Use 1 million sequences rather than 100.000 to find any overrepresented ones.

Checklist

[ ] Pull request details were added to CHANGELOG.rst
[ ] Documentation was updated (if needed)

The previous implementation used a separate hash array and entry array. Each occupied hash would correspond with an occupied entry array at the same spot.

This PR introduces a entry_index array containing uint32_t indexes. Each occupied hash corresponds with an entry_index in the entry_index array. The entry_index then corresponds with an entry in the entry array. This allows the entry_array to simple fill up from index 0 to MAX_UNIQUE_SEQUENCES -1. A similar dictionary scheme is used in CPython. This preserves order of insertion as well as taking up significantly less memory.

rhpvorderman / sequali

Use 1 million sequences rather than 100.000 to find any overrepresented ones. #23

Checklist