legumeinfo / gcv

Federating genomes with love (and synteny derived from functional annotations)
https://gcv.legumeinfo.org/
Apache License 2.0
40 stars 12 forks source link

pan-gene/gene family identifers can't contain dashes #1022

Closed ekcannon closed 7 months ago

ekcannon commented 7 months ago

The maize pan-genes have names like pan-zea.v2.pan00001, which breaks the GCV. It was necessary to change them to PanZea.v2.pan00001. While the change is fairly easy, it would be best to use the same names everywhere.

adf-ncgr commented 7 months ago

Hi @ekcannon I might need a little more info on this; as far as I know there's no reason the identifiers can't contain dashes unless there are some weird searching issues they cause due to the way the redisearch system tokenizes things. Can you give some more detail on what broke when you used dashes?

john-portwood commented 7 months ago

Hi Andrew,

The pan-zea.v2.pan00001 names in the GFA files were causing the micro synteny cluster to not show any matches when searching a gene model. The below screenshot shows an example of this behavior from a similar issue that I was running into when setting up GCV initially last summer (I regrettably don't have a screenshot of the exact issue, but can produce one if requested): image

When removing the hyphen from the pan gene identifiers in the GFA files, expected behavior was returned as shown in this screenshot:

image

Hope this helps, if you have any additional questions let me know.

Thanks! John

adf-ncgr commented 7 months ago

Thanks @john-portwood, that at least gives me some better sense for what to look for when I try to reproduce the behavior. Based on the screenshot, it sounds like it shouldn't involve the redisearch matching at all, so I'm now very puzzled. But should be easy to add some dashes in our test data gfa and see what happens there. Thanks again for the extra info!

adf-ncgr commented 7 months ago

OK, I just tried to reproduce this according to my understanding of what you've described but it seems to be working just fine when there are dashes in the gene family identifiers, as shown in the screenshot below (where I just substituted dashes for underscores in the gfa file for our test datafiles). So, I think it must have been some other issue you guys were having previously. I'm going to close this since it seems that the ids can contain dashes without causing obvious problems, but feel free to reopen it or start a new issue if you can't make things work using the naming scheme you prefer.

image