jeffdaily / parasail

Pairwise Sequence Alignment Library
Other
243 stars 34 forks source link

support for alphabets with more than 256 symbols #65

Open hernot opened 5 years ago

hernot commented 5 years ago

Make parasail aware of alphabest whith more than 256 symbols eg utf8 strings and others. This might require to represent scoring table rather as sparse matrix utilizing an additional lookuptable which transcribes alphabet which has no gaps between the used symbols, thereby minimizing the size of the scoring matrix or table.

for example when using only capital ASCII letters the lookuptable could be used to recode the alphabet to the symbols 0-26 and thus reduce the size of the scoring matrix to 26x26 entries if at all.

jeffdaily commented 5 years ago

Would you mind providing a use case for this request? Is this for generic, non-biological sequences?

hernot commented 5 years ago

We have a set of about 6000 signals generated by a whole fleet of maintenance machines which we are currently transferring into the could. dependent upon the firmware version of the machine control system and the path the data is exported the labels for the same signals may vary in language used to from them, in the words and abbreviations embedded within them as well as improper use of instead of simple white space and varying upper and lower case, which is equal for human reader but difficult for the importer to our data cloud to identify.

We would need the support for at least an utf-8 type alphabet or a gap-less subset thereof, in order to provide our customers with hard numbers how big the problem is and how, effort it would be, and thus how much it would cost to match and transparently map the already imported data with the finally established standardised signal labels used on the latests firmware versions, instead of just tossing the old data and start over with the data generated by a modern firmware.

A rather simple example would be

V-1.0) 'Engine 1 Speed kmh' V-1.6) 'engine 1 speed [kmh]' V-2.) 'Engine 1 Speed'

The first two just differ in case which matters for the import system but not for human reader of the header.

And these are just rather simple and obvious examples. As the language may also differ especially if the only means to export the data is through a freely configurable data logger system, which is the case for very old machines still in use by the customer. In that case the service technician which activates and configures the data export is able to define the label used to export the data, which may also be Japanese language.

hernot commented 5 years ago

Related to this would also be to add a parasail::CASESENSITIVE flag to the parasail_create_matrix method which changes the mapper such that lower and upper case letters are not referred to each other any more and string which, as the ones cited previously, differ in casing are considered as edits, in contrast to default where they are considered equal. This would already help a lot.

jeffdaily commented 5 years ago

I think I can definitely support a case-insensitive alphabet. However, I don't think I could easily support alphabets with more than 256 characters. All parasail interfaces assume the sequences are arrays of unsigned bytes. That interface only supports 256 characters.

hernot commented 5 years ago

Hi

Inline Response (answers below questions asked) Message sent on Mittwoch, 23.01.2019, 22:50 +0000 by Jeff Daily:

I think I can definitely support a case-insensitive alphabet.  However, I don't think I could easily support alphabets with more than 256 characters.  All parasail interfaces assume the sequences are arrays of unsigned bytes.  That interface only supports 256 characters.

It is clear to me that this is nothing to be solved within short time. Therefore take this as feature request for future. Second most languages especially the ones based upon Latin or Greek and alike encode their words in less that 128 symbols. Further as long as it is not necessary to cross correlate strings from different languages it would therefore be possible to use the mapping table, used by parasail anyway to ensure case insensitivity, to remap the symbols from their naturals position within the encoding which can be above ordinal number 256 within a symbol range of [0,256). The side effect thereof would be that all alphabets including subsets of Latin1 and ASCII could be thereby condensed within the smallest set of symbols possible. The only thing to be changed besides extending the lookup table would be to ensure that the mappingtable is used for both strings to map from origingal encoding to parasail internal condensed unsigned 8bit encoding, which assumingly would mean the biggest code changes compared to the change in the mapping table. 

But a side effect would be that parasail could support any set of unique symbols independent of encoding as long as the total number of symbols does not exceed 256. 

Can not promise if i have time to hack a python example for what i do mean. But i try to find some time.

Best Xristoph

hernot commented 5 years ago

By the way. I played around with trying to to remap alphabets with less than 256 symobls to symbols from the latin1 encoding, such that pairs of upper and lower case characters within the original alphabet are mapped to a pair of upper and lower case letters from the latin 1 encoding. Thereby an idea crossed my mind which might simplify the problem of support of case sensitive alphabets or encodings with more than 256 symbols.

As long as the number of symbols used by the effective alphabet is less than 256 the alphabet and thus the strings can be recoded to fit into the range between 1 and 255 (0 is the string terminal symbol in c) prior to passing the alphabet and later on the strings to parasail. The issue about case sensitivity or pairs of upper and lower case not being aligned any more with upper and lower case pairs in latin 1 could be solved as follows:

instead of calling upper() and lower() methods from libc directly introduce function pointers which could be called instead for example 'alphabet_upper' and 'alphabet_lower' having the same signature as the upper() and lower() methods of the libc. Both per default would be initialized to point to the latter two. And via the method parasail_set_casing_handlers(upper,lower) or parasail_set_casing_handlers(alphabet,upper,lower) if it is preferred to associate them with the alphabet (user matrix) the users of parasail library could provide individual replacements for upper and lower for properly handling the alignment of upper and lower case letters or simply return the passed id if matching should be case sensitive.

please let me know what you think and if you have any questions.