waldronlab / BugSigDB

A microbial signatures database
https://bugsigdb.org
6 stars 6 forks source link

Ontology export columns are incomplete #92

Closed lgeistlinger closed 2 years ago

lgeistlinger commented 3 years ago

@tosfos I'd like to filter signatures exported from BugSigDB by review status for #80. This seems currently not possible as the only column that we have in the export csv files on review status is Revision editor.

However, the field Revision editor has value "WikiWorks743" for reviewed content (see eg https://bugsigdb.org/Study_255) as well as content that still needs to be reviewed (see eg https://bugsigdb.org/Study_400). Can this be changed so that this column is empty or NA in the export for content that still need to be reviewed?

tosfos commented 3 years ago

That sounds good. Should we also add code for the wiki to ignore the WikiWorks... users for these fields? Usually our edits are not very scientifically useful so we shouldn't be getting credit :).

lwaldron commented 3 years ago

Yeah that would make sense to hide the Wikiworks74 etc curations :)

lgeistlinger commented 3 years ago

There are some more considerations to this though. From the discussion arising in #93, it becomes clear that we would actually like the exported column "Revision editor" to list the person who marked the content as reviewed as @ftzohra22 points out in https://github.com/waldronlab/BugSigDB/issues/93#issuecomment-880132766.

lwaldron commented 3 years ago

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

lgeistlinger commented 3 years ago

Agreed.

lwaldron commented 3 years ago

Adding a priority label to this as we will make our first versioned data release (https://github.com/waldronlab/BugSigDBExports/issues/4) as soon as it is resolved.

lgeistlinger commented 3 years ago

Moving @lwaldron's comment from #93 here:

I also noted that the "Complete/Incomplete" status would be worth exporting too. So the two items here are to include in the export

1. "Reviewed by" (missing or who marked study/experiment/signature as reviewed), and

2. Complete or Incomplete (or TRUE/FALSE)
lgeistlinger commented 3 years ago

I'd like to add two more columns in the export:

  1. The EFO ID of the corresponding condition investigated (such as EFO:0001075 for "ovarian carcinoma")
  2. The UBERON ID of the corresponding body site sampled (such as UBERON:0000341 for "throat")

as it arises from https://github.com/waldronlab/BugSigDB/issues/55#issuecomment-877253482. This will facilitate ontology-based queries downstream of the export in eg bugsigdbr and/or BugSigDBStats.

tosfos commented 3 years ago

Sorry. I'm not clear on how to proceed with the WikiWorks user removal. Right now Curator and Revision Editor can list a WikiWorks user. Do we want to change either one or both? And if we are removing the WikiWorks user, is it OK if we end up with these fields as blank?

tosfos commented 3 years ago

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

Do we also want the "Reviewer" field to be prominently displayed next to the Revision Editor field (on the site itself)? Or is OK that you need to hover over the (!) icon in order to see it?

lwaldron commented 3 years ago

Sorry. I'm not clear on how to proceed with the WikiWorks user removal. Right now Curator and Revision Editor can list a WikiWorks user. Do we want to change either one or both? And if we are removing the WikiWorks user, is it OK if we end up with these fields as blank?

I don't feel strongly about this, but blank seems equally informative as WikiWorks user, so fine with me. I would want to search for those pages where both curation and revision editor are blank, and at least make sure they are at least reviewed. I don't think it seems worthwhile to dig out again the pre-wiki curators of these pages.

Do we also want the "Reviewer" field to be prominently displayed next to the Revision Editor field (on the site itself)? Or is OK that you need to hover over the (!) icon in order to see it?

Again I don't feel strongly about this. Both seem OK, although displaying the Reviewer may be a bit better to recognize reviewing as an important contribution. FYI in case this helps inform a decision: we currently have many unreviewed pages, and although we will catch up some, reviewing is almost as much work as the original curation and it seems likely that curation will always outpace reviewing.

tosfos commented 3 years ago

Complete or Incomplete (or TRUE/FALSE)

This part is now "Complete". The requested column can be found in any CSV.

tosfos commented 3 years ago

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

Done.

tosfos commented 3 years ago

I'd like to add two more columns in the export:

1. The EFO ID of the corresponding condition investigated (such as [EFO:0001075](http://www.ebi.ac.uk/efo/EFO_0001075) for "ovarian carcinoma")

2. The UBERON ID of the corresponding body site sampled (such as [UBERON:0000341](http://purl.obolibrary.org/obo/UBERON_0000341) for "throat")

Please see if this looks OK in the new CSVs.

lgeistlinger commented 3 years ago

As noted in #94, it is problematic for downstream applications that these changes to the export files also cause the link locations to change. It would be preferable if we would have stable links such as eg https://bugsigdb.org/export/experiments.csv that would always point to the latest version of these bulk export files.

lgeistlinger commented 3 years ago

Columns State (for completion status) and Reviewer (for review status) look great.

lgeistlinger commented 3 years ago

EFO / UBERON ID: It looks like https://github.com/waldronlab/BugSigDB/issues/55#issuecomment-877253482 precedes this, as these columns are currently mostly blank. We are missing eg the corresponding EFO ID for terms like "adenoma". This is true for the export files but also the pages itself (eg https://bugsigdb.org/Study_1/Experiment_1), where links to the EFO IDs / EFO pages are missing.

Also, IDs such as EFO:0001075 and UBERON:0000341 are currently abbreviated as 1075 and 341 in the export columns, but I think it would be preferable to have them exported fully spelled out.

tosfos commented 3 years ago

I figured out what's happening. The leading zeros are being exported, as you can see in a text editor: image

The spreadsheet application is interpreting these as numbers and removing the leading zeros, but it's not the application's fault. Semantic MediaWiki should be surrounding these fields with quotes, since they are stored internally as a text data type. This is likely a bug with Semantic MediaWiki. We'll research this and see what we can do.

tosfos commented 3 years ago

My previous comment was incorrect. Semantic MediaWiki is doing everything right. The SMW CSV exporter simply uses the PHP native fputcsv to create its CSV files. And the CSV spec appears to say that numbers don't need to be wrapped in quotes. So the spreadsheet application is what is at fault since technically all CSV fields should be interpreted as text, but I guess it's doing its best to figure out what a user would expect.

If we want to work around this we would need to modify the extension to use our own custom CSV encoder, which would be a bit of a project and probably not a good idea.

lgeistlinger commented 3 years ago

Right, the IDs are indeed exported with leading zeros (my bad), but would it be possible to export them with leading "EFO:" for condition and leading "UBERON:" for body site? Not a problem if not, as we can also add them downstream in our R application. It would just be more convenient and more straightforward to use for everyone who doesn't use the R application but rather works on the exported files themselves.

lwaldron commented 3 years ago

"EFO:" and "UBERON:" are also not a bad idea to include if it's not difficult because there are also other potentially relevant ontologies. We could in theory mix ontologies and use downstream tools to map them, and even if we never do that, it's more communicative to someone downloading the file who isn't familiar with its contents.

lgeistlinger commented 3 years ago

It's a great point. As we already observed (https://github.com/waldronlab/BugSigDB/issues/55#issuecomment-801474939), EFO actually already is an umbrella ontology, meaning not all EFO IDs start with "EFO:", but also "CHEBI:", "Orphanet:", "HP:", and "MONDO:", ... In numbers only 11,338 out of 27,175 terms in the EFO start with "EFO:". We thus indeed need the prefixes as available from the Term ID <-> Term name mappings.

tosfos commented 3 years ago

We combined the two columns as requested: image

Please let me know if this looks OK now.

lwaldron commented 3 years ago

Looks good! My only question is why so many EFO IDs are missing, since all the condition terms were taken from the EFO?--

Levi Waldron

Associate Professor

Department of Epidemiology and Biostatistics

CUNY Graduate School of Public Health and Health Policy

Institute for Implementation Science in Population Health

55 W 125th St, New York NY 10035

https://waldronlab.io

Join the microbiome Virtual International Forum: https://microbiome-vif.org

ftzohra22 commented 3 years ago

I also noticed these on CSV exports (my filters were 'streptococcus' 'homo sapiens' and 'feces') :

  1. Some conditions and EFO missing as @lwaldron mentioned: My guess is, this could be because a lot of them aren't entered correctly (not validated terms), for example: (https://bugsigdb.org/Study_426) 'Autoimmune Hepatitis' entered as lower case 'autoimmune hepatitis', might have resulted in empty export cells for both condition and EFO.
  2. It looks like this is also the case for the matched on column- 'sex' is missing on the csv which isn't listed as a validated term on the wiki.

image

lgeistlinger commented 3 years ago

Two points here:

  1. @tosfos: where are the mappings term name (eg "obesity") -> term ID (eg "EFO:00...") coming from? They seem to be incomplete.
  2. @ftzohra22 @rimsha1 @lwaldron : we need reviewers to continously screen for invalid entries entered by curators in the condition, body site, matched on, confounders controlled for fields. The clean-up pages are supposed to assist with that.
tosfos commented 3 years ago

They come from the glossary. Right now, we'd need to create a new page for each condition (although we could import a CSV). However, this also depends on #89. If we go ahead with that, we'll be able to query their API and remove the need for the Glossary.

lwaldron commented 2 years ago

I think this is higher priority than #89 and needed for basic functionality, whereas #89 is more of a wishlist. For now, the following would suffice for basic functionality:

  1. a page for each condition
  2. the ontology columns being complete in the export
  3. ideally, the complete EFO and Uberon ontologies in the system for autocomplete etc, but we could live longer with having to enter manually on an as-needed basis like we have been.
lwaldron commented 2 years ago

Adding the bug label since specifically for the incomplete ontology columns in the export. That is the final remaining item for our first "official" export of bugsigdb, see https://github.com/waldronlab/BugSigDBExports/issues/4#issuecomment-917712225. This is a high priority item.

lgeistlinger commented 2 years ago

Hi @tosfos: how do you think we should best proceed for this? Should I provide a csv for importing or do you go ahead with pulling the mappings directly from EFO and UBERON?

tosfos commented 2 years ago

Sorry about the delay! We'll send over a CSV shortly with a header row and an example row. That will help you fill it out.

tosfos commented 2 years ago

Glossary Import Format.csv Please use the attached CSV. Here are some notes on the fields:

  1. Title will be used as the page name
  2. Term should be lowercase (or an abbreviation), so that we can match it up with the current Body site and Condition fields.
  3. Definition should be brief and in plain text.
  4. Alias can accept multiple values, delimited with a semicolon ;
  5. Link must be a valid URL pointing to UBERON or EFO (for the Body site or Condition terms respectively - see the Oral gland example).
  6. Latin is a single latin equivalent for the term.
lgeistlinger commented 2 years ago

Sounds good. Which field is used to store the term ID (eg "EFO:0001075") for a given term name (eg "ovarian carcinoma")?

tosfos commented 2 years ago

We should be able to derive this automatically from the URL. I don't think we need to worry about that field.

lgeistlinger commented 2 years ago

Alright. A look at what information is typically available for a term in the official EFO OBO release - here for term "ovarian carcinoma":

[Term]
id: EFO:0001075
name: ovarian carcinoma
def: "A malignant neoplasm originating from the surface ovarian epithelium. It accounts for the greatest number of deaths from malignancies of the female genital tract and is the fifth leading cause of cancer fatalities in women. It is predominantly a disease of older white women of northern European extraction, but it is seen in all ages and ethnic groups. Adenocarcinomas constitute the vast majority of ovarian carcinomas. The pattern of metastatic spread in ovarian carcinoma is similar regardless of the microscopic type. The most common sites of involvement are the contralateral ovary, peritoneal cavity, para-aortic and pelvic lymph nodes, and liver. Lung and pleura are the most common sites of extra-abdominal spread. The primary form of therapy is surgical. The overall prognosis of ovarian carcinoma remains poor, a direct result of its rapid growth rate and the lack of early symptoms. --2002" [NCIT:C4908]
comment: Editor note: unclear why this is distinct from malignant ovarian epithelial tumor in NCIT.
synonym: "carcinoma of ovary" EXACT [MONDO:patterns/carcinoma, NCIT:C4908]
synonym: "carcinoma of the ovary" EXACT [NCIT:C4908]
synonym: "epithelial ovarian cancer" EXACT [NCIT:C4908]
synonym: "ovarian cancer" EXACT [NCIT:C4908]
synonym: "ovarian carcinoma" EXACT [] {comment="Mondo preferred label 15.08.2021."}
synonym: "ovarian carcinoma" EXACT [DOID:4001, NCIT:C4908]
synonym: "ovarian carcinoma" EXACT [] {comment="preferred label from MONDO"}
synonym: "ovarian epithelial cancer" EXACT [NCIT:C4908]
synonym: "ovary carcinoma" EXACT [MONDO:patterns/location]
xref: DOID:4001 {source="EFO:0001075", source="MONDO:equivalentTo"}
xref: EFO:0001075 {source="DOID:4001", source="MONDO:equivalentTo"}
xref: MONDO:0005140
xref: NCIT:C4908 {source="DOID:4001", source="EFO:0001075", source="MONDO:equivalentTo"}
is_a: Orphanet:398934 {source="DOID:4001", source="MONDO:Redundant", source="MONDOLEX:0005140", source="NCIT:C4908"} ! Malignant epithelial tumor of ovary
relationship: EFO:0000784 UBERON:0000992 ! has_disease_location ovary
property_value: closeMatch http://linkedlifedata.com/resource/umls/id/C0677886
property_value: closeMatch http://linkedlifedata.com/resource/umls/id/C0677886
property_value: exactMatch DOID:4001
property_value: exactMatch DOID:4001
property_value: exactMatch NCIT:C4908
property_value: exactMatch NCIT:C4908
property_value: gwas:trait "true" xsd:string

That means I will be able to deliver fields 1-5 of your csv with

  1. Title -> name
  2. Term -> name
  3. Definition -> def
  4. Alias -> synonym

(lhs: fields in your csv; rhs: fields in the EFO)

  1. Links will be of the form:

(this also provides for how term IDs should be exported in the ontology columns that we request in the thread above).

  1. EFO/UBERON do not provide latin equivalents for the terms, and I don't see a way of easily obtaining that otherwise. I will thus leave this field blank if not otherwise indicated.
tosfos commented 2 years ago

We added the Latin column since we assumed it would be a nice feature given the site's medical nature. If it won't be used, we'll remove it. Please advise.

On second thought, you can ignore the Term column. It looks like it will always have the same data as the Page Name, so we can just calculate the lowercase of that field instead.

lgeistlinger commented 2 years ago

The latin column sounds indeed like a nice feature, but as EFO/UBERON do not provide latin equivalents for the terms, and I don't see a way of easily obtaining that otherwise, I suggest to just drop that column.

I have accordingly prepared the csvs for:

I expect some hiccups and some manual work when matching those against the terms that are already on bugsigdb.org. The question is whether it's better to clean up offending terms before or after the import (ie those terms that are already on bugsigdb.org but don't have an entry in the import files above).

And just to keep in the back of our heads: EFO and UBERON update on a regular basis, ie we should start thinking at some point about how to automatically keep bugsigdb.org's glossary in sync with EFO/UBERON - but it is my understanding that this will be part of the postponed activities planned in #89.

tosfos commented 2 years ago

Thanks. We'll drop the latin column.

Does it make sense to first clean up some of the strange formatting like: 293 cell$;A-293 cell$;A293 cell$;HEK cell$;HEK293 cell$;human embryonal kidney cell$;human embryonic kidney cell

Or is that something we would do after the import?

some manual work when matching those against the terms

If they are valid terms, it makes sense to just add these as aliases, and these can be discovered and fixed after the import. If they are not valid (like a misspelling), they should probably be cleaned up before the import.

it is my understanding that this will be part of the postponed activities planned in #89.

Correct.

lgeistlinger commented 2 years ago

As described previously, the synonyms for a specific term can contain all kind of special characters so some improvisation was needed as to what to use for separating synonyms. This is technically a comma-separated list of synonyms for each term, but as commas and semi-colons are partly contained in the synonyms - I used $; to separate individual synonyms within the list. Would this be a separator that you can process or would you need a different separator? (which one)

lgeistlinger commented 2 years ago

I could potentially also just replace any existing commas in the synonyms, and provide a comma-separated list of synonyms for each term if that makes processing easier.

lgeistlinger commented 2 years ago

If they are valid terms, it makes sense to just add these as aliases, and these can be discovered and fixed after the import. If they are not valid (like a misspelling), they should probably be cleaned up before the import.

@tosfos I have accordingly updated the csvs for:

This required action on the following items on bugsigdb.org for "conditon" - which were not a term title in the efo but either:

(i) represented existing synonyms for an EFO term (eg. "race" for EFO term "ethnic group"), (ii) were added as synonyms to appropriate EFO terms where indicated (eg. "sessile serrated adenoma" for EFO term "Colon Sessile Serrated Adenoma/Polyp"), (iii) were added as independent condition terms not in EFO but another ontology such as CHEBI (eg "medroxyprogesterone acetate"), (iv) or were cleaned up on bugsigdb.org itself (eg. "irritable bowel sydrome" -> "irritable bowel syndrome").

"sessile serrated adenoma" "oral halitosis"
"sjogren's syndrome" "substance related disorder"
"titanium dioxide nanoparticles" "air pollution"
"crohn disease" "medroxyprogesterone acetate"
"cervical cerclage" "equol"
"irritable bowel sydrome" "urinary track infection"
"kidney stone"
"female reproductive organ cancer" "end stage renal disease"
"hiv/aids pre-exposure prophylaxis" "race"
"antibiotic"

No such actions were needed for body site / UBERON.

I could potentially also just replace any existing commas in the synonyms, and provide a comma-separated list of synonyms for each term if that makes processing easier.

I discarded this idea, as a quick check of synonyms in EFO revealed a total of 14,344 synonyms containing a comma. This includes numerous instances where replacing a comma would obstruct meaning as eg for chemical notations such as "(3S)-3-(4-hydroxyphenyl)-3,4-dihydro-2H-1-benzopyran-7-ol". I thus suggest going ahead with "$;" as separator in the alias/synonym field if that can be processed on your end.

Both files would thus be ready for import on my end.

lwaldron commented 2 years ago

Great, @lgeistlinger. From my perspective, #73 getting the dynamic home page onto the home page, followed closely by this issue, are our top priorities.

tosfos commented 2 years ago

We tested out the "$;" as a delimiter and it seems to work fine. We did a test import of some of these terms, and you can see 17 new terms as part of the Glossary.

Questions about the aliases.

  1. When using the form, it will only autocomplete based on the Condition list that is enumerated on the Condition property page. It will not offer any aliases. Is that an issue?
  2. Similarly, if an alias has been added as the value for one of these fields, the page will work that it is now an allowed value for this property. Is that an issue?
  3. Is there any particular way that you're expecting the wiki to treat aliases? For example, I'm thinking that if a Condition is set to an alias, instead of semantically storing that alias as this Experiment's Condition/Body site, we could store the "main" glossary term that corresponds to that alias. We could also display the "main" term on the Experiment page, though that might be confusing when editing the page. (This would solve the issue mentioned in item 2.)
lwaldron commented 2 years ago
  1. That seems fine to me.
  2. That seems OK, although it would be nice to have a method for cleanup of the Condition property page. i.e. I noticed we currently have:
    • nicotine dependence
    • Nicotine dependence
    • non-alcoholic fatty liver
    • non-alcoholic fatty liver disease I noticed these by eye but it would be great to see any unknown Conditions here and be able to tell administrators / reviewers how to avoid entering or at least identify non-standard, ie non-ontology main terms.
  3. That sounds nice. In an ideal world it would not matter whether someone entered or searched for a term or its alias, because they would be equivalent but the term's page would belong to the main term. However I'm not even sure whether the aliases of distinct terms are always disjoint, if not, that may not be possible anyways. This is not critical but some redirecting of aliases to their "main" glossary term would be very nice.
tosfos commented 2 years ago

I noticed these by eye but it would be great to see any unknown Conditions here and be able to tell administrators / reviewers how to avoid entering or at least identify non-standard, ie non-ontology main terms.

Are you referring to terms that are neither "main" terms nor aliases? If so, we already track fields that are set improperly.

lgeistlinger commented 2 years ago

Hi @lwaldron any more comments on this from your side? Other than that, I think we can go ahead with the full import @tosfos. We are trying to make the Oct 25 deadline of the Bioconductor release schedule to have ontology columns properly exported and freeze a first release of BugSigDB. Thanks!

lgeistlinger commented 2 years ago

Just wanted to quickly check in in here one more time as to whether the Oct 25 deadline is realistic for that? It's no problem at all if we don't make it, then we introduce changes somewhere along the way, but it would make for a nice clean introduction into the Bioc release if we were able to meet that date. Thanks!

lwaldron commented 2 years ago

I think we're ready to go.

tosfos commented 2 years ago

We should be able to make that date. I just need to know how to implement it. I'm not sure I understand your comment in item 2 above. See my question above.

Are you referring to terms that are neither "main" terms nor aliases? If so, we already track fields that are set improperly.

lwaldron commented 2 years ago

I guess terms that we display and allow curators to enter should only be main terms? Aliases seem more useful for user searches.--

Levi Waldron

Associate Professor

Department of Epidemiology and Biostatistics

CUNY Graduate School of Public Health and Health Policy

Institute for Implementation Science in Population Health

55 W 125th St, New York NY 10035

https://waldronlab.io

Join the microbiome Virtual International Forum: https://microbiome-vif.org