The main idea is to provide a list of read names and to iterate over the file to find the matching reads. Example use cases:
Extract read names from a region of a mapped BAM file and extract all the unmapped reads contained there.
Extract read and mates from a region of a mapped BAM in two steps.
Grep a concrete set of reads generating weird results in an analysis (or debug code)
Requirements are the following (mostly based on performance):
Efficient contains for the read name dictionary: this is the most important feature for the tool to be fast. Lookup for each read will be performed, and both the names and the number of reads might be large. For example the Aho-Corasick algorithm can be used, which has implementations in java that are available in maven central (e.g., https://github.com/robert-bor/aho-corasick or https://github.com/hankcs/AhoCorasickDoubleArrayTrie).
Efficient storage of read name "dictionary": the memory usage should be low even if millions of read pairs are added. This fits with the idea of using the Aho-Corasick algorithm, which is based on tries. As an example of trie library in java available in maven central is https://github.com/takawitter/trie4j. Maybe this is enough without the Aho-Corasick algorithm.
Speed up by providing a number of maximum matches: for example, if the read is found already once, remove from the dictionary for faster lookups. This should be an option for the user.
Option to invert search, such that reads not in the set are output.
As an optional feature, which might not be included in the first implementation, it will be nice to have a way to filter out reads with a ReadFilter - e.g., only first of pair/second of pair.
A putative problem might arise from converting the read names internally to the standard format: e.g., user provides "readName/1" instead of "readName". We should be clear that we do not handle such cases (a future improvement is to normalize provided names or switch-off the conversion for this tool).
The main idea is to provide a list of read names and to iterate over the file to find the matching reads. Example use cases:
Requirements are the following (mostly based on performance):
ReadFilter
- e.g., only first of pair/second of pair.