"Hybrid" Experiment: Calculate the sequence of pitches based on segmentation and feed the results to the transformer generation loop

Intro

Starting with a disclaimer: I'm not an AI developer, so this whole idea might not pan out as expected.

The transformer operates as a sequence-to-sequence model: it takes an encoded image and the sequence of previous symbols (or tokens) and predicts the next one. Through training, it learns to anticipate the subsequent symbol based on the image and preceding symbols. This mechanism enables the transformer to grasp concepts such as the typical sequence of elements in musical staves, like the common occurrence of a clef followed by a time signature. Nonetheless, it can occasionally generate erroneous predictions, such as suggesting a note not present in the image, due to learned associations from its training data (referred to as hallucinations).

It remains uncertain whether we encounter this issue. At times, we notice the transformer making significantly inaccurate predictions, performing less effectively compared to models like https://github.com/BreezeWhite/oemer, which relies solely on segmentation, classification, and meticulously designed code for semantic interpretation.

Experiment

I'm considering an experiment that involves pre-processing the input data before feeding it into the transformer model. Specifically, we can use an approach like the one implemented in https://github.com/BreezeWhite/oemer to determine the pitches on the staff (or all symbols) in advance.

Once we have these pre-determined results, we can then compare each symbol generated by the transformer with the corresponding pre-processed result. By establishing a method to assess the likelihood of each prediction, we can make informed decisions about whether to accept the transformer's prediction.

If we determine that the transformer's prediction is less likely or inaccurate compared to the pre-processed result, we have two potential courses of action:

a) Rerun the transformer at the current step with a higher temperature setting to introduce more randomness into the sampling process, potentially yielding a different, more accurate prediction.

b) Disregard the specific prediction generated by the transformer and instead accept the result calculated through pre-processing. We would then continue with the transformer loop using this accepted result.

These strategies aim to optimize the reliability and accuracy of the transformer's predictions while maintaining flexibility in the decision-making process. However, further experimentation and validation would be necessary to assess the effectiveness of these approaches in practice.

Code pointers

https://github.com/liebharc/homr/blob/main/homr/staff_parsing_tromr.py#L27 this already calculates the pitches
https://github.com/liebharc/homr/blob/main/homr/transformer/decoder.py#L162 the decoder loop where we could use the information
https://github.com/liebharc/homr/pull/4 introduced confident values and center of attention to homr

liebharc / homr