As revealed in #9 the two falcon models have different attention layers. 40b does not have a multi_query config options, and seems to be multi_query by default. So investigate this further to see if we can combine the two models using the same attention layer.
As revealed in #9 the two falcon models have different attention layers. 40b does not have a multi_query config options, and seems to be multi_query by default. So investigate this further to see if we can combine the two models using the same attention layer.