microbialphenotypes / OMPwiki

Bug reports, feature requests, etc. for OMPwiki
microbialphenotypes.org
0 stars 0 forks source link

Nichols phenotype browser: ECK accession ids that lack gene names #14

Open dsiegele opened 7 years ago

dsiegele commented 7 years ago

ECK2859 is ygeQ ECK2647 is ypjC ECK1128 is ymfH ECK1933 is yedM ECK4219 is yzfA ECK4265 is yjgW ECK1453 is yncM ECK3675 is glvC ECK2647 is ypjC ECK0369 is yaiU ECK2856 is ygeN ECK2132 is yohH ECK2675 is ygaY ECK2859 is ygeQ ECK0679 is ybfH ECK1132 is ymfT

sandyl27 commented 6 years ago

Hey Debbie where did you find these errors?

dsiegele commented 6 years ago

Sandy,

These are strains used by Nichols et al. that had only ECK identifiers, but not gene names. I looked up the gene names in EcoCyc based on the ECK_IDs.

If you enter any of these ECK_IDs into the strain box for data browser, the data browser will return fitness scores for the various conditions tested, but the rows are identified by only the ECK_ID. For example, go to this page http://ecoliwiki.net/tools/chemgen/?qtype=s_growth&item1=ECK2859 and click submit.

For most other strains, if you enter either the gene name or ECK-ID into the strain box, for example, if you enter either arcA or ECK4393, the rows that are returned are identified by both the ECK_ID and the gene name.

dsiegele commented 6 years ago

I came across some additional problems:

1) I found more strains in the data browser that don't have gene names: ECK4426 and ECK3474. There are probably more, I will look through the list and see what I find.

2) I found a strain that has the wrong gene name. ECK2858 should be named ygeP, but is named ECK2858-ygeQ' in the databrowser. ECK2859, which is one of the strains that doesn't have a gene name, should be named ygeQ. This mistake isn't in the list of strains in the Nichols paper (TableS2-column 1).

3) The strain list and the data browser have different numbers of strains. If you enter a condition, such as novobiocin-12, you get back information for 3,979 entries. While if you click on the box 'List strains,' you get a list that contains only 3,967 entries.

I am going to compare the 3,967 strains with the strain list from Nichols_TableS2 and double check all the gene names with what is in EcoCyc.

dsiegele commented 6 years ago

1) The difference in the number of strains is due to the 12 rows that appear to be duplicated in TableS2. The duplicated rows are: ECK0295-YKGO ECK1323-YMJC' ECK1544-GNSB ECK1556-HOKD ECK1824-MGRB ECK2613-SMPA ECK3357-YHFL ECK3531-DPPA ECK4410-YDGU ECK4415-YPFM ECK4416-RYFB ECK2593-A-YFIO* - Truncation

2) How were the duplicates handled in the data browser? I searched for one of the duplicated strains, ECK1323, and the condition novobiocin. There were two entries for each condition. So the data browser has the data for each of the duplicates, whereas the strain list has only 1 listing of each strain.

dsiegele commented 6 years ago

There are 41 rows in the data browser that are missing the gene name that goes with the ECK_ID. I will get the gene names for these from EcoCyc. ECK0012 ECK0017 ECK0266 ECK0320 ECK0359 ECK0367 ECK0369 ECK0503 ECK0619 ECK0679 ECK1128 ECK1132 ECK1159 ECK1160 ECK1453 ECK1933 ECK1990 ECK2132 ECK2331 ECK2636 ECK2637 ECK2647 ECK2650 ECK2651 ECK2652 ECK2675 ECK2854 ECK2856 ECK2859 ECK2994 ECK3474 ECK3672 ECK3675 ECK3769 ECK3802 ECK4097 ECK4219 ECK4265 ECK4330 ECK4334 ECK4426

sandyl27 commented 6 years ago

Oh so is this on EcoliWiki or OMP?

dsiegele commented 6 years ago

I searched for one of the duplicated strains, ECK1323, and the condition novobiocin. There were two entries for each condition. This explains the difference in the number of entries when you search for a condition and the number of strains in the strain list.

dsiegele commented 6 years ago

I didn't find the data browser on OMP until today. Last night, I saw that the link to the data browser that is on the main page of OMP goes to the data browser at EcoliWiki. When I did a search for the string "Nichols," I only found papers and the link on the main page . It occurred to me this afternoon that the string search might not have searched the Special Pages. So I went to the list of Special Pages and found the link to the Nichols data browser that is on OMP.

On both versions of the Nichols data browser, the ECK_IDs I listed above need gene names added.

The OMP Nichols data browser is missing boxes where you can link to the List of strains and the List of Conditions.

On the OMP version of the Nichols data browser, if I select the radio button 'Growth data (Strain/condition)' and enter a specific condition, such as Novobiocin-12, I get a list of the fitness scores for all 3,979 strain rows from the Nichols paper, which indicates to me that the strain list includes the 12 strains that were done in duplicate. However, if I select the radio button 'Growth data (Strain/condition),' enter ECK1323, which is one of the strains done in duplicate, and a condition, such as Novobiocin, I get fitness scores for only one of the 2 duplicates. In contrast, if I do the same thing on the EcoliWiki data browser, I get both sets of fitness scores.

I can see arguments for including both sets of data for each of these 12 strains or for only including one of the strains in the data browser. Whatever we decide, we should try to make the number of strains consistent. We could give the duplicate samples different names to keep them separate. Maybe ECK1323 (row 586) and ECK2859-ygeQ (row 3299)???