Closed lwaldron closed 3 years ago
Collecting here preparations needed from our side (assigning to me):
Also @lgeistlinger:
Actually maybe it would not be so hard to do after the fact - seems to good to be true, and I guess this broad search & replace risks making replacements in description fields too, but look! https://bugsigdb.org/Special:ReplaceText. Still, it would also be reasonably straightforward to do in the spreadsheet or the munging script. By the way, I notice there are other relevant ontologies too that include statistical methods, e.g. https://www.ebi.ac.uk/ols/ontologies/obcs and https://www.ebi.ac.uk/ols/ontologies/ncit. I'm unsure how to choose between these ontologies, I will reach out to some authors to find out who would be willing to make updates based on the methods we see in the literature. If you're wondering why I care, it's because I imagine being able to weigh in on some of the methodological disputes in the field (e.g. compositional vs non-compositional-aware tests, parametric vs non-parametric testing) based on patterns we observe in signatures resulting from those testing procedures. Presumably, artifacts arising from non-compositional tests on compositional data should leave some kind of "signature" if the effect is substantial enough.
Perhaps "Host Species" should be using ontology.
That is most likely the NCBI Taxonomy again (also available in ontology formats: http://www.obofoundry.org/ontology/ncbitaxon.html)
Perhaps statistical methods and study design should be using an ontology https://www.ebi.ac.uk/ols/ontologies/stato https://www.ebi.ac.uk/ols/ontologies/obcs https://www.ebi.ac.uk/ols/ontologies/ncit
The NCIT is the best-curated most comprehensive among them and well-suited for representing study design.
For statistical test, we have 2 issues:
1) we would need to resolve the conflation of statistical tests (t-test, wilcoxon test, ...) and computational tools (limma, edgeR, DESeq2, lefse, ...) that implement these tests first. While all three ontologies quite decently represent statistical tests, none of them (also not their objective) represent computational tools, although I found limma in the OBCS:
http://purl.obolibrary.org/obo/OBCS_0000168
2) For computational tools, this ontology seems to be a promising starting point:
https://www.ebi.ac.uk/ols/ontologies/swo (software ontology)
see eg. limma: http://www.ebi.ac.uk/swo/SWO_0000593 edgeR: http://www.ebi.ac.uk/swo/SWO_0000527
but does not cover the full range that we need (eg misses lefse, DESeq2, ...)
It seems to use these ontologies effectively there could be a couple approaches:
What do you think? Since we don't have a lot of terms we could certainly just go our own way without an ontology, but if there are collaborative ontology developers it could be useful and bring in additional curation expertise from a different angle than we have. It could be especially good to cooperate with the NCI Thesaurus people since this grant is funded from the NCI from so they may have an added interest in collaborating.
Going with the NCIT sounds promising and worth a try.
After some more consideration and revisiting your thoughts above
If you're wondering why I care, it's because I imagine being able to weigh in on some of the methodological disputes in the field (e.g. compositional vs non-compositional-aware tests, parametric vs non-parametric testing) based on patterns we observe in signatures resulting from those testing procedures. Presumably, artifacts arising from non-compositional tests on compositional data should leave some kind of "signature" if the effect is substantial enough.
I think what we are really interested in is the statistical test conducted (eg kruskal-wallis for lefse), and the information that we get from SWO (is an R package, is a Bioconductor package, ...) is thus not really useful.
NCIT or STATO seem better suited for that purpose where I could imagine lefse to be child of
http://purl.obolibrary.org/obo/STATO_0000094
or
http://purl.obolibrary.org/obo/NCIT_C53248
allowing further summarization (non-parametric vs parametric) as you envisioned.
One approach that would work out of the box is to thus annotate the NCIT/STATO term that most closely matches the statistical test conducted, while I still see the value of stating exactly which tool has been used, ie xx% of curated studies used DESeq2, xx% lefse, etc.
@lwaldron With the start of the new semester and the new cohort of curators arriving, I think it's a good time point to push for the transition from curation via google sheets to bugsigdb.org. If you want to we could meet beginning of next week and lay out the concrete steps needed to make this happen. Is there anything specific that would prevent us from switching right now?
The mapping between study and associated experiments is currently off, see eg https://bugsigdb.org/Study_99.
That's a study carried out in China on gastric adenocarcinoma, but we are seeing Experiment 217 annotated to it (which reports results for a study in the US on chronic kidney disease).
I guess that'll be fixed upon re-import?
I agree that it would be good to switch over to curating on bugsigdb.org. There is still potentially a fair bit of work to be done on the taxa and physiology data model(s) but those seem fairly tangential to our current curation activities. I think it could still take a while to finalize those aspects, and it would be good to start gaining experience in real-life use of the wiki for signatures curation, while creating documentation and editing the front page for public release, in the meantime.
@tosfos what do you think of creating a "development" version of the wiki for significant development like these, while we go ahead and start using the current version as the "release" version for signatures curation?
I'd be ready for this, meaning I cleaned up the data on our side, re-created the study.csv
, experiment.csv
, and signature.csv
files - this time also resetting Experiment and Signature counters as discussed before (https://github.com/waldronlab/BugSigDB/issues/3#issue-585276017).
The files for the re-import are here. These are now 413 studies, 950 experiments, and 1604 signatures.
If possible next week can you send cleaned-up files (see https://github.com/waldronlab/BugSigDB/issues/38)? This will be the "dress rehearsal", and once Ike confirms no issues with these files, we will schedule permanent switch-over ASAP. Ike has another site release coming in late December, so it is better if we do ours sooner than later.
@tosfos @lwaldron:
We are now ready for re-importing the extended dataset and have told our curators to stop working on the spreadsheet and resume curation on the wiki once the data has been re-imported.
I cleaned up the data on our side, re-created the study.csv
, experiment.csv
, and signature.csv
files - this time also resetting Experiment and Signature counters as discussed before (https://github.com/waldronlab/BugSigDB/issues/3#issue-585276017).
The files for the re-import are here. These are now 425 studies, 972 experiments, and 1652 signatures.
@tosfos: please check whether these files look good or whether they require any modifications prior to re-import.
Please also note:
revision
column in the signature table is blank the corresponding signature, experiment, and study should be imported as “needs review” (if the revision
column is not blank, the triple signature-experiment-study should be considered reviewed)@lwaldron: please check whether I am missing something.
Thanks Ludwig! The only other thing I can think of is that the experiment columns "matched on" and "confounders controlled for" should now become type-ahead autocomplete fields.
@tosfos, note that the host species have changed, and are now NCBI species names.
@tosfos if you can test importing this data into the Wiki, @lgeistlinger will check the re-export, then we'll be ready to do the final switch-over.
Perfect :-)
And yes: https://github.com/waldronlab/BugSigDB/issues/38#issuecomment-738346450 re: "matched on" and "confounders controlled for"
Experiments:
Signatures:
If this is any trouble at all, we can take care of it.
I guess that'll be fixed upon re-import?
We messed around with lots of the data as tests. After re-import everything should look better.
If this is any trouble at all, we can take care of it.
No problem, we will rework the files accordingly.
Should we assume that all pages in this CSV can be marked Complete
Yes, but I'll check once more over them while making the requested small modifications and will report back.
what do you think of creating a "development" version of the wiki for significant development like these, while we go ahead and start using the current version as the "release" version for signatures curation?
That makes sense. Can you clarify what you mean by "current version"? Is that the version we're about to re-import?
Can you clarify what you mean by "current version"? Is that the version we're about to re-import?
@lwaldron can say more, but I think the idea here is to have a release version and a development version of the wiki. The release version is public and is used by the curators to add new signatures via the wiki after re-import. The devel version is private and allows to continue developments and add new features to the wiki without being disruptive to the curators. Synchronization / Merging of release and development version then only happens in defined time frames, eg every half a year, or once significant new features are introduced. Basically like two different branches on github such as a main branch and a new feature branch.
How will you handle the merge? Manually? Will any content be added to devel?
The signature has some curation columns (date ,curation, revision), but I though we got rid of manual entry for these in https://github.com/waldronlab/BugSigDB/issues/24 and switched to automatically calculating these columns.
if the revision column in the signature table is blank the corresponding signature, experiment, and study should be imported as “needs review” (if the revision column is not blank, the triple signature-experiment-study should be considered reviewed)
Does "needs review" mean that it is reviewed as "incorrect" or just that there is no review yet?
For the pages that should be considered "reviewed" we need more data than just the fact that it was reviewed. When was it reviewed and by whom? See: https://bugsigdb.org/w/index.php?title=Review:Study_1/Experiment_1/Signature_1&action=formedit
Does "needs review" mean that it is reviewed as "incorrect" or just that there is no review yet?
the former, there is no review yet.
When was it reviewed and by whom?
The column "revision" in the signatures.csv
file lists the name of the reviewer ( = by whom).
We didn't record the date of review so that could be either the date of curation (= column "date" in the signatures.csv
file) or the date of re-import (today?) by default.
The signature has some curation columns (date ,curation, revision), but I though we got rid of manual entry for these in #24 and switched to automatically calculating these columns.
I think we have to distinguish here between (i) how to fill in this information when bulk importing signatures as we do it now, and (ii) when signatures are entered manually/individually on the wiki as referred to in #24.
Thanks for taking care of this Ludwig, I agree with all your answers. A devel wiki would make sense in the future if there's a need for us to test potentially disruptive developments in a sandbox-like environment before merging code changes (not curation) into the public site.
Experiments:
1. For the Alpha Diversity fields, many are set to "unknown". Can you make them blank? 2. Can you change all YES to Yes and NO to No?
Signatures:
1. Can you change all "NA" fields to blank? (May not be relevant due to #2:) 2. Can you concatenate all the NCBI fields into one comma-separated field like "186801, k__Bacteria|p__Firmicutes|c__Clostridia|o__ Clostridiales|f__ Mogibacteriaceae, 990719, 31979"
If this is any trouble at all, we can take care of it.
@tosfos : I updated the files here as requested. Please check.
Should we assume that all pages in this CSV can be marked Complete
Yes. I double-checked that.
Will do.
Allowing for the "Revision editor" to be manually set for a bulk import like this is a bit tricky but I think it should be OK.
Right now we are querying for who actually created or modified the page and displaying their user name. I think we can add in the manual entry too and then try to make sure we don't display the same person twice. For example, if the manually ented "Revision editor" then actually edits the page in the wiki, we'll make sure that they are only displayed one time. We're already doing something similar to makes sure that if the page creation then edits the page, they're not displayed twice.
Creating the "automatic review" pages for bulk-imported pages is also tricky but doable.
I'm a little confused about the "revision editor" issue and "automatic review" pages for bulk imported pages. But since this bulk import is a one-time thing, it does not seem critical how editing history of the previous spreadsheet is handled. Even if "revision editor" were initiated as pristine and all pages set as "reviewed" it wouldn't seem like a terrible loss of information. So any partway step that is more straightforward, like just recording whether there is evidence that the signature was reviewed by a second curator or not, should also be adequate I think.
We did the import (feel free to review) but there's a major issue that I didn't see until after the import. We're going to nuke everything again so any changes will be purged.
The issue is that the Experiment and Signature CSVs use a different numbering system for Experiments. On the Experiments CSV, the Experiment numbering resets when the Study number is incremented. On the Signature CSV, the Experiment numbering keeps incrementing even after the Study has changed.
I also found a few more issues with the CSV that we can correct easily, but it might as well be done on your side if convenient.
Experiments:
The 2 “16s variable region” columns: We need to remove the “V”. There should only be an integer. (Note that there are errors in this data - some of the rows have text in them. We'll import them as is.)
Signatures:
The “Increased abundance…” column is titled internally as “Abundance in Group 1” and we need all the “Yes” values to be changed to “increased” and all the “No” values to be changed to “decreased”.
Also, can you convert the signatures that have pipes in them so that the pipes become semi-colons? I'm referring to data like: kBacteria|pFirmicutes|c__Clostridia|...
Sure, we'll apply the requested modifications and will report back.
@tosfos : I updated the files here as requested. This includes resetting the experiment counter also in the signature csv and the small format modifications that you listed. Please check.
Should be all done!
We have to work on the Curator and Revision editor fields as I mentioned. Everything else looks OK. Please review and let me know.
If everything looks OK, feel free to start editing. The work we need to do on the editor fields won't be affected.
Looks great @tosfos! I'll do some more manual screening over the pages today and will follow-up with some more systematic import-export checks in #41.
There seems to be a small hiccup with the import as there are apparently a handful of experiments without a signature which shouldn't be the case - I am looking into how we produce the csv files to clarify.
I confirmed that the error is in the CSV files. The Experiment CSV contains certain experiments that have no corresponding signatures in Signatures.csv
Yes, I have fixed this and will provide updated csv files in a bit. I'm currently doing some more checks on them. We could potentially fix this manually directly on the site for these handful of experiments - but maybe it's easiest to just re-import? What do you think?
@tosfos the updated csv files are here and I think it's easiest (at least from our side) if we would just re-import those files. There was a misalignment between experiment and corresponding signatures for the experiments listed above as those had small inconsistencies in the free-form columns.
Done. Looks like there are 2 missing Experiments, but the Studies and Signatures are in there: https://bugsigdb.org/Help:Cleanup
Thanks! Looks like another corner case. I'll look into that.
Closing this and continue discussion under #41
After essential data entry implementation and testing, we will freeze data entry in the current spreadsheet, import all current data to bugsigdb.org, then resume all data entry there. We need to plan the process to let curators know when the downtime will be.
I expect more minor data entry issues to turn up once curators are using the site full-time, so we should either make sure there is some Y1 budget remaining or that we are ready to invoice Y2 after the switch-over.