Adds the ability to handle promoters (regions with both promoter nucleotide information and non-promoter codon information)
Adds handling of insertions and deletions in nucleotide and codon sequence
Most of the Pointfinder species should be handled after this update, but we should double-check them.
Some notable "junky" stuff:
Added logic to handle all of the comments (lines starting with "#") in the database files. Some of the older #gene_id column header processing has been replaced with gene_id.
Seemingly, codon insertions are off-by-one (+1) in the Pointfinder database. The code has tried to report the position accurately, but considers that Pointfinder will be off-by-one. There are special cases throughout the code to handle this.
Because of the above problem, an extra column is added in the pointfinder.tsv output, which re-adjusts the mutation back to the incorrect (off-by-one) Pointfinder database co-ordinate so that users can see that it matches the Pointfinder database (even though it's wrong in that database).
Promoters are handled as a special case, where all mutations in the "negative" part are considered 0-based nucleotide co-ordinate mutations and all mutations in the "positive" part are considered 1-based codon co-ordinate mutations. Promoters are identified by having the word promoter in their name. This assumes such names will only be used for promoters. The length of the negative component is derived from the FASTA record name.
Codon deletions where there are no reference nucleotides listed need to be handled as a special case. For some reason, they use 0-based nucleotide co-ordinates for these codon deletions, instead of 1-based codon co-ordinates, like other codon mutation database entries (deletions, point mutations). The Pointfinder database table is modified after being loaded to accommodate this.
Related to the above, the mutations are listed as a nucleotide deletion of length mod 3, instead of showing the codon deleted. Since staramr needs to compare consistently, a conversion from nucleotides to codons is done, since the "directionality" of the conversion favours that direction.
If there's a deletion in the INPUT sequence, then Pointfinder expects 1) the REFERENCE sequence to be listed as having the deletion (even though there's no deletion in the reference); and 2) the INPUT sequence to be listed as the deleted reference bases (rather than '---').
Some notable "junky" stuff:
#gene_id
column header processing has been replaced withgene_id
.pointfinder.tsv
output, which re-adjusts the mutation back to the incorrect (off-by-one) Pointfinder database co-ordinate so that users can see that it matches the Pointfinder database (even though it's wrong in that database).promoter
in their name. This assumes such names will only be used for promoters. The length of the negative component is derived from the FASTA record name.