intermine / pombemine

0 stars 1 forks source link

(BJ) allele data #20

Closed ValWood closed 2 years ago

ValWood commented 2 years ago

Allele data only returns the allele description, but this is in the "allele >DB-identifier" column

The data for alleles is name (symbol) : e.g wee1-50 cdc2-A21 cdc2-Y15F  type : deletion, amino acid substitution, amino acid deletion etc... description : D134K, 1-22, (these only make sense in the context of the 'type')

I think for the unique identifier we were going to use the allele name (this should be unique, some currently are 'un-named' in the legacy pre-PomBase data curation. I think we decided just to not import these into pombemine (it is only 140 annotations and we have prioritised to fix these)

danielabutano commented 2 years ago

@ValWood I think we decided to use the symbol (as identifier) and if it's not presemt, use the description between parenteshis. Before running a query, please remeber to select the fields on the left panel

image

ValWood commented 2 years ago

Something is a bit mixed up because most of the entries in the "allele symbol" column are not allele symbols:

None of these are symbols (these are all allele descriptions)

Screenshot 2022-01-21 at 17 56 18
ValWood commented 2 years ago

I wonder if we are exporting some things in the allele symbol field that are not symbols @kimrutherford

ValWood commented 2 years ago

Actually something is more mixed up here.

Looking at the first row in @danielabutano screenshot, it says

allele description G521T, etc and allele type "nucleotide mutation" but this is an "amino acid substitution"

ValWood commented 2 years ago

@ValWood I think we decided to use the symbol (as identifier) and if it's not present, use the description between parenteshis.

I thought we considered that at first, but later decided since a) the number of lost annotations would be so low (it is less than 0.1% of our total phenotype annotations and b) are probably already covered correctly elsewhere anyway c) we are prrioritising fixing these old annotations

to just not import this small number. They look a bit strange because all of our symbol identifiers include the gene name (and unfortunately these odd-bods are the ones that are immediately visible at the top of the table).

~My current feeling, from what I can see from the top 250 rows of the table is that maybe the allele ~synonym~ symbol is not being loaded?~

ValWood commented 2 years ago

My current feeling, from what I can see from the top 250 rows of the table is that maybe the allele ~synonym~ symbol is not being loaded?

This was incorrect, but the allele description is being appended to the allele ~synonym~ symbol: oxa1-I107A(I107A) oxa1-I107S(I107S) oxa1-L111A(L111A) oxa1-L111W(L111W)

rachellyne commented 2 years ago

@val the data are correct according to the file we have. We can probably decide to leave the odd ones (where we have appended the description) out on the next build. We can look in more detail when we next meet.

ValWood commented 2 years ago

Hmm you are right. I have a log file with a list of 285 to fix, but these are not included. clearly there are more than we anticipated. I will get them added to the logs and fixed!

@kimrutherford I'll open a ticket about this.

ValWood commented 2 years ago

Actually it is a bit of a mixed bag.

For fcp1 only one alle is incorrectly named. https://www.pombase.org/gene/SPAC19B12.05c

1-486 (487-723 Δaa) which is at the top of the list:

Screenshot 2022-01-24 at 12 19 48

The rest of the fcp1 alleles are correctly described. However for all of these the allele symbol is missing its prefix "fcp1'"

Screenshot 2022-01-24 at 12 21 57

All PomBase allele symbols should begin with the 3-letter code (those which don't have this prefix are incorrectly recorded, I am fixing them). So is correct that you are currently displaying fcp1-1-486  as **** (because that was incorrectly recorded). I just fixed it.

but incorrect to display fcp1-D170A as D170A

allele symbol is fcp1-D170A allele description is D170A The descriptions are standardized depending on allele type.

Some (but not all) alleles are named exactly as their description (more recently people have stopped giving alleles names like wee1-50 and just naming like wee1-G850E). I think this is where the confusion between allele symbol and allele description arises.

ValWood commented 2 years ago

I just had a thought. I have been fixing the alles that did not have the correct prefixes over the past couple of months so maybe this is why there are differences. When is the data you are currently loading from?

rachellyne commented 2 years ago

November!