schneebergerlab / msyd

MIT License
9 stars 0 forks source link

Feature Request: Consistent access to genome ranges #5

Open mnshgl0110 opened 1 year ago

mnshgl0110 commented 1 year ago

Currently, the pansyn object treats the reference genome differently then the other genomes. Consequently, different steps are required to access the coordinates for the reference and other genomes.

print(df.iloc[0][0])
Pansyn(Range(chm13, NC_060946.1, NaN, 10588391, 12801843), {'mat': Range(mat, CM039032.1, NaN, 3783261, 5889298), 'pat': Range(pat, CM039055.1, NaN, 1, 1723416)})

Is there any specific region for this? If not, then would not it be better to treat all genomes similarly and allow consistent parsing scheme? I guess, this would also be useful when we start identifying crossyn between query genomes only.

lrauschning commented 1 year ago

Hi Manish, the reason for this is that the reference genome has a special role in the synteny intersection algorithm, as its the genome the regions are "joined" on. The way I thought of handling reference-free multisynteny calling would be to leave the reference field as None – which will not work with the intersection algorithm as it is implemented at the moment as that is inherently reference-based, but it would be possible to adapt that. I've been thinking of writing an associated function that shifts the reference Range into the main dict and leaves it empty to seamlessly transition between reference-based and non-reference-based identification. A reverse function taking an organism from the dict and shifting it to the ref field could then be used for adapting to reference-free synteny intersection, perhaps imputing the cigar strings for each entry in the Cigars dict. Do you think this approach makes sense?