Duplicate alleles in JSON export due to differences in expression

jseager7 commented 1 year ago

When creating an allele that only differs by its expression from an existing allele, Canto still creates a new allele object, even though it could theoretically reuse the existing allele.

For example, if I create two wild type alleles, one with Overexpression and another with WT level expression…

…then two alleles end up being created in the JSON export.

"alleles" : {
  "Q00909:8c528101a7771fa8-1" : {
      "allele_type" : "wild_type",
      "gene" : "Fusarium graminearum Q00909",
      "name" : "TRI5+",
      "primary_identifier" : "Q00909:8c528101a7771fa8-1",
      "synonyms" : []
  },
  "Q00909:8c528101a7771fa8-2" : {
      "allele_type" : "wild_type",
      "gene" : "Fusarium graminearum Q00909",
      "name" : "TRI5+",
      "primary_identifier" : "Q00909:8c528101a7771fa8-2",
      "synonyms" : []
  }
}

This strangeness is due to the expression being a property of the parent genotype object, rather than the allele object itself.

"8c528101a7771fa8-genotype-18" : {
    "loci" : [
      [
          {
            "expression" : "Overexpression",
            "id" : "Q00909:8c528101a7771fa8-1"
          }
      ]
    ],
    "organism_strain" : "1104-14",
    "organism_taxonid" : 5518
},

I can't think of any problems with reusing the existing allele, though I realise that the code to figure out which allele to reuse could be a bit complicated.

While looking through the issue tracker I found the script fix-duplicate-allele.pl which might handle this case, though I haven't tested it yet.

For context, I noticed this problem due to writing a script to migrate data from older versions of PHI-base to the Canto JSON export format. One of the checks I wrote for the export checked for duplicated feature objects. The check involved simply concatenating all the object property values as strings and storing those strings in a lookup table. The duplicate alleles got caught when I merged the PHI-Canto curation sessions into this generated export.

I can just ignore these cases in my pipeline by checking the expression of the parent genotype object, so I'm not being held up by this problem as of yet.

kimrutherford commented 1 year ago

Hi James. Sorry about this problem. It's happening because very early on in the Canto development it was easier to store the expression as part of the allele. So if an allele has two different expression levels in different genotypes then there will be two rows in the allele table for that allele - there's an "expression" column in allele table.

As I say a lot, that seemed like a good idea at the time. That representation matches the display but I think if I implemented it again I would separate the allele and the expression.

The export code moves the expression to the genotypes, as in your example. Which leaves duplicate alleles, with unique IDs. We've never had to think about this problem for PomBase because we de-duplicate all alleles when loading into our main database. We read allele information from multiple sources so de-duplication would be necessary for PomBase even if Canto didn't produce duplicates.

I think it would be too much work too separate the allele and expression in the Canto database because too much code would need to change. We could possibly do some extra processing in the JSON export code to remove the duplicates but we don't have time to do that at the moment.

kimrutherford commented 1 year ago

While looking through the issue tracker I found the script fix-duplicate-allele.pl which might handle this case, though I haven't tested it yet.

I think that was to fix a different problem. In that case there were duplicate rows in the allele table even in cases where the expression was unset, like for deletion alleles.

jseager7 commented 1 year ago

@kimrutherford Thanks for the information.

I don't think it's necessary to remove JSON duplicate alleles before exporting, since I can just post-process the JSON export myself to remove the duplicates. Given that the alleles exist in the database, it arguably makes sense to export them and let data consumers choose what to do with them.

Could you let me know what logic you use for removing duplicate alleles for PomBase? Do you simply check for all alleles with key–value pairs that are identical to some other allele?

kimrutherford commented 1 year ago

Could you let me know what logic you use for removing duplicate alleles for PomBase? Do you simply check for all alleles with key–value pairs that are identical to some other allele?

Yep, more or less. Every time a new allele is loaded, we query the database to see if there is an existing allele with the same gene, allele name, type and description. If there's an existing allele we use that rather than the new allele.

If we use an existing allele but the new allele has synonyms, we store any that aren't already in the database, attaching them to the existing allele. Also comments and notes from the new allele get attached to the existing one.

There are a bunch of PomBase specific extra logic to our allele merging process. For example if an existing allele has no description and we try to load an allele with a matching gene, allele name and type we then set the description of the existing allele to the description of the new allele, then use the existing allele. Mostly this won't be a problem for alleles just from Canto.

I'm happy to chat about this if you'd like to know more.

jseager7 commented 1 year ago

Thanks, that's helpful. I hadn't considered merging the allele synonyms so I'll try to implement that in my pipeline for the future (not sure how many synonyms we've curated at the moment, if any).

I'm happy for this issue to be closed as not planned if it's too much work.

pombase / canto

Duplicate alleles in JSON export due to differences in expression #2747