Re-import of data - Githubissues

lwaldron commented 4 years ago

After essential data entry implementation and testing, we will freeze data entry in the current spreadsheet, import all current data to bugsigdb.org, then resume all data entry there. We need to plan the process to let curators know when the downtime will be.

I expect more minor data entry issues to turn up once curators are using the site full-time, so we should either make sure there is some Y1 budget remaining or that we are ready to invoice Y2 after the switch-over.

lgeistlinger commented 4 years ago

Collecting here preparations needed from our side (assigning to me):

Resolve all fields in the curation sheet with invalid format (eg with a non-numeric sig.thresh) before importing to wiki
Look over cases with multiple entries, comma-separated, make policy decisions for each column (eg different 16S variable regions for a study)
NAs get transformed to blank upon import?
change curator names to valid usernames that we will use in the wiki (https://github.com/waldronlab/BugSigDB/issues/25)

lwaldron commented 4 years ago

Also @lgeistlinger:

ensure that any entries that should be using ontology terms are, since this will be much harder to search & replace after our final bulk import. https://bugsigdb.org/Help:Admin (I just made you an administrator). Can get help from Rimshaa and Shaimaa with the mapping.
Perhaps "Host Species" should be using ontology.
Perhaps statistical methods and study design should be using an ontology, e.g. https://www.ebi.ac.uk/ols/ontologies/stato / http://stato-ontology.org/

lwaldron commented 4 years ago

Actually maybe it would not be so hard to do after the fact - seems to good to be true, and I guess this broad search & replace risks making replacements in description fields too, but look! https://bugsigdb.org/Special:ReplaceText. Still, it would also be reasonably straightforward to do in the spreadsheet or the munging script. By the way, I notice there are other relevant ontologies too that include statistical methods, e.g. https://www.ebi.ac.uk/ols/ontologies/obcs and https://www.ebi.ac.uk/ols/ontologies/ncit. I'm unsure how to choose between these ontologies, I will reach out to some authors to find out who would be willing to make updates based on the methods we see in the literature. If you're wondering why I care, it's because I imagine being able to weigh in on some of the methodological disputes in the field (e.g. compositional vs non-compositional-aware tests, parametric vs non-parametric testing) based on patterns we observe in signatures resulting from those testing procedures. Presumably, artifacts arising from non-compositional tests on compositional data should leave some kind of "signature" if the effect is substantial enough.

lgeistlinger commented 4 years ago

Perhaps "Host Species" should be using ontology.

That is most likely the NCBI Taxonomy again (also available in ontology formats: http://www.obofoundry.org/ontology/ncbitaxon.html)

lgeistlinger commented 4 years ago

Perhaps statistical methods and study design should be using an ontology https://www.ebi.ac.uk/ols/ontologies/stato https://www.ebi.ac.uk/ols/ontologies/obcs https://www.ebi.ac.uk/ols/ontologies/ncit

The NCIT is the best-curated most comprehensive among them and well-suited for representing study design.

For statistical test, we have 2 issues:

1) we would need to resolve the conflation of statistical tests (t-test, wilcoxon test, ...) and computational tools (limma, edgeR, DESeq2, lefse, ...) that implement these tests first. While all three ontologies quite decently represent statistical tests, none of them (also not their objective) represent computational tools, although I found limma in the OBCS:

http://purl.obolibrary.org/obo/OBCS_0000168

2) For computational tools, this ontology seems to be a promising starting point:

https://www.ebi.ac.uk/ols/ontologies/swo (software ontology)

see eg. limma: http://www.ebi.ac.uk/swo/SWO_0000593 edgeR: http://www.ebi.ac.uk/swo/SWO_0000527

but does not cover the full range that we need (eg misses lefse, DESeq2, ...)

lwaldron commented 4 years ago

It seems to use these ontologies effectively there could be a couple approaches:

ask the SWO authors if they are willing to a) add more terms that we suggest, and b) define parent terms in one of the ontologies with statistical methods. I think it is possible to join two ontologies this way, am I correct?
collaborate with the NCIT authors to add more terms especially for software. It already seemed to have most of the statistical methods that we've seen. As a general-purpose ontology, it wouldn't be polluting it to mix software and statistical methods.

What do you think? Since we don't have a lot of terms we could certainly just go our own way without an ontology, but if there are collaborative ontology developers it could be useful and bring in additional curation expertise from a different angle than we have. It could be especially good to cooperate with the NCI Thesaurus people since this grant is funded from the NCI from so they may have an added interest in collaborating.

lgeistlinger commented 4 years ago

Going with the NCIT sounds promising and worth a try.

After some more consideration and revisiting your thoughts above

If you're wondering why I care, it's because I imagine being able to weigh in on some of the methodological disputes in the field (e.g. compositional vs non-compositional-aware tests, parametric vs non-parametric testing) based on patterns we observe in signatures resulting from those testing procedures. Presumably, artifacts arising from non-compositional tests on compositional data should leave some kind of "signature" if the effect is substantial enough.

I think what we are really interested in is the statistical test conducted (eg kruskal-wallis for lefse), and the information that we get from SWO (is an R package, is a Bioconductor package, ...) is thus not really useful.

NCIT or STATO seem better suited for that purpose where I could imagine lefse to be child of

http://purl.obolibrary.org/obo/STATO_0000094

or

http://purl.obolibrary.org/obo/NCIT_C53248

allowing further summarization (non-parametric vs parametric) as you envisioned.

One approach that would work out of the box is to thus annotate the NCIT/STATO term that most closely matches the statistical test conducted, while I still see the value of stating exactly which tool has been used, ie xx% of curated studies used DESeq2, xx% lefse, etc.

lgeistlinger commented 4 years ago

@lwaldron With the start of the new semester and the new cohort of curators arriving, I think it's a good time point to push for the transition from curation via google sheets to bugsigdb.org. If you want to we could meet beginning of next week and lay out the concrete steps needed to make this happen. Is there anything specific that would prevent us from switching right now?

lgeistlinger commented 4 years ago

The mapping between study and associated experiments is currently off, see eg https://bugsigdb.org/Study_99.

That's a study carried out in China on gastric adenocarcinoma, but we are seeing Experiment 217 annotated to it (which reports results for a study in the US on chronic kidney disease).

I guess that'll be fixed upon re-import?

lwaldron commented 4 years ago

I agree that it would be good to switch over to curating on bugsigdb.org. There is still potentially a fair bit of work to be done on the taxa and physiology data model(s) but those seem fairly tangential to our current curation activities. I think it could still take a while to finalize those aspects, and it would be good to start gaining experience in real-life use of the wiki for signatures curation, while creating documentation and editing the front page for public release, in the meantime.

@tosfos what do you think of creating a "development" version of the wiki for significant development like these, while we go ahead and start using the current version as the "release" version for signatures curation?

lgeistlinger commented 4 years ago

I'd be ready for this, meaning I cleaned up the data on our side, re-created the study.csv, experiment.csv, and signature.csv files - this time also resetting Experiment and Signature counters as discussed before (https://github.com/waldronlab/BugSigDB/issues/3#issue-585276017).

The files for the re-import are here. These are now 413 studies, 950 experiments, and 1604 signatures.

lwaldron commented 3 years ago

If possible next week can you send cleaned-up files (see https://github.com/waldronlab/BugSigDB/issues/38)? This will be the "dress rehearsal", and once Ike confirms no issues with these files, we will schedule permanent switch-over ASAP. Ike has another site release coming in late December, so it is better if we do ours sooner than later.

lgeistlinger commented 3 years ago

@tosfos @lwaldron:

We are now ready for re-importing the extended dataset and have told our curators to stop working on the spreadsheet and resume curation on the wiki once the data has been re-imported.

I cleaned up the data on our side, re-created the study.csv, experiment.csv, and signature.csv files - this time also resetting Experiment and Signature counters as discussed before (https://github.com/waldronlab/BugSigDB/issues/3#issue-585276017).

The files for the re-import are here. These are now 425 studies, 972 experiments, and 1652 signatures.

@tosfos: please check whether these files look good or whether they require any modifications prior to re-import.

Please also note:

all imported studies / experiments / signatures should be marked as “complete” on the wiki
if the revision column in the signature table is blank the corresponding signature, experiment, and study should be imported as “needs review” (if the revision column is not blank, the triple signature-experiment-study should be considered reviewed)

@lwaldron: please check whether I am missing something.

lwaldron commented 3 years ago

Thanks Ludwig! The only other thing I can think of is that the experiment columns "matched on" and "confounders controlled for" should now become type-ahead autocomplete fields.

@tosfos, note that the host species have changed, and are now NCBI species names.

@tosfos if you can test importing this data into the Wiki, @lgeistlinger will check the re-export, then we'll be ready to do the final switch-over.

lgeistlinger commented 3 years ago

Perfect :-)

And yes: https://github.com/waldronlab/BugSigDB/issues/38#issuecomment-738346450 re: "matched on" and "confounders controlled for"

tosfos commented 3 years ago

Experiments:

For the Alpha Diversity fields, many are set to "unknown". Can you make them blank?
Can you change all YES to Yes and NO to No?

Signatures:

Can you change all "NA" fields to blank? (May not be relevant due to #2:)
Can you concatenate all the NCBI fields into one comma-separated field like "186801, kBacteria|pFirmicutes|cClostridia|o Clostridiales|f__ Mogibacteriaceae, 990719, 31979"

If this is any trouble at all, we can take care of it.

tosfos commented 3 years ago

I guess that'll be fixed upon re-import?

We messed around with lots of the data as tests. After re-import everything should look better.

lgeistlinger commented 3 years ago

If this is any trouble at all, we can take care of it.

No problem, we will rework the files accordingly.

Should we assume that all pages in this CSV can be marked Complete

Yes, but I'll check once more over them while making the requested small modifications and will report back.

tosfos commented 3 years ago

what do you think of creating a "development" version of the wiki for significant development like these, while we go ahead and start using the current version as the "release" version for signatures curation?

That makes sense. Can you clarify what you mean by "current version"? Is that the version we're about to re-import?

lgeistlinger commented 3 years ago

Can you clarify what you mean by "current version"? Is that the version we're about to re-import?

@lwaldron can say more, but I think the idea here is to have a release version and a development version of the wiki. The release version is public and is used by the curators to add new signatures via the wiki after re-import. The devel version is private and allows to continue developments and add new features to the wiki without being disruptive to the curators. Synchronization / Merging of release and development version then only happens in defined time frames, eg every half a year, or once significant new features are introduced. Basically like two different branches on github such as a main branch and a new feature branch.

tosfos commented 3 years ago

How will you handle the merge? Manually? Will any content be added to devel?

tosfos commented 3 years ago

The signature has some curation columns (date ,curation, revision), but I though we got rid of manual entry for these in https://github.com/waldronlab/BugSigDB/issues/24 and switched to automatically calculating these columns.

tosfos commented 3 years ago

if the revision column in the signature table is blank the corresponding signature, experiment, and study should be imported as “needs review” (if the revision column is not blank, the triple signature-experiment-study should be considered reviewed)

Does "needs review" mean that it is reviewed as "incorrect" or just that there is no review yet?

For the pages that should be considered "reviewed" we need more data than just the fact that it was reviewed. When was it reviewed and by whom? See: https://bugsigdb.org/w/index.php?title=Review:Study_1/Experiment_1/Signature_1&action=formedit

lgeistlinger commented 3 years ago

Does "needs review" mean that it is reviewed as "incorrect" or just that there is no review yet?

the former, there is no review yet.

lgeistlinger commented 3 years ago

When was it reviewed and by whom?

The column "revision" in the signatures.csv file lists the name of the reviewer ( = by whom). We didn't record the date of review so that could be either the date of curation (= column "date" in the signatures.csv file) or the date of re-import (today?) by default.

lgeistlinger commented 3 years ago

The signature has some curation columns (date ,curation, revision), but I though we got rid of manual entry for these in #24 and switched to automatically calculating these columns.

I think we have to distinguish here between (i) how to fill in this information when bulk importing signatures as we do it now, and (ii) when signatures are entered manually/individually on the wiki as referred to in #24.

lwaldron commented 3 years ago

Thanks for taking care of this Ludwig, I agree with all your answers. A devel wiki would make sense in the future if there's a need for us to test potentially disruptive developments in a sandbox-like environment before merging code changes (not curation) into the public site.

lgeistlinger commented 3 years ago

Experiments:

1. For the Alpha Diversity fields, many are set to "unknown". Can you make them blank?

2. Can you change all YES to Yes and NO to No?

Signatures:

1. Can you change all "NA" fields to blank? (May not be relevant due to #2:)

2. Can you concatenate all the NCBI fields into one comma-separated field like "186801, k__Bacteria|p__Firmicutes|c__Clostridia|o__ Clostridiales|f__ Mogibacteriaceae, 990719, 31979"

If this is any trouble at all, we can take care of it.

@tosfos : I updated the files here as requested. Please check.

Should we assume that all pages in this CSV can be marked Complete

Yes. I double-checked that.

tosfos commented 3 years ago

Will do.

tosfos commented 3 years ago

Allowing for the "Revision editor" to be manually set for a bulk import like this is a bit tricky but I think it should be OK.

Right now we are querying for who actually created or modified the page and displaying their user name. I think we can add in the manual entry too and then try to make sure we don't display the same person twice. For example, if the manually ented "Revision editor" then actually edits the page in the wiki, we'll make sure that they are only displayed one time. We're already doing something similar to makes sure that if the page creation then edits the page, they're not displayed twice.

tosfos commented 3 years ago

Creating the "automatic review" pages for bulk-imported pages is also tricky but doable.

lwaldron commented 3 years ago

I'm a little confused about the "revision editor" issue and "automatic review" pages for bulk imported pages. But since this bulk import is a one-time thing, it does not seem critical how editing history of the previous spreadsheet is handled. Even if "revision editor" were initiated as pristine and all pages set as "reviewed" it wouldn't seem like a terrible loss of information. So any partway step that is more straightforward, like just recording whether there is evidence that the signature was reviewed by a second curator or not, should also be adequate I think.

tosfos commented 3 years ago

We did the import (feel free to review) but there's a major issue that I didn't see until after the import. We're going to nuke everything again so any changes will be purged.

The issue is that the Experiment and Signature CSVs use a different numbering system for Experiments. On the Experiments CSV, the Experiment numbering resets when the Study number is incremented. On the Signature CSV, the Experiment numbering keeps incrementing even after the Study has changed.

tosfos commented 3 years ago

I also found a few more issues with the CSV that we can correct easily, but it might as well be done on your side if convenient.

Experiments:

The 2 “16s variable region” columns: We need to remove the “V”. There should only be an integer. (Note that there are errors in this data - some of the rows have text in them. We'll import them as is.)

Signatures:

The “Increased abundance…” column is titled internally as “Abundance in Group 1” and we need all the “Yes” values to be changed to “increased” and all the “No” values to be changed to “decreased”.

Also, can you convert the signatures that have pipes in them so that the pipes become semi-colons? I'm referring to data like: kBacteria|pFirmicutes|c__Clostridia|...

lgeistlinger commented 3 years ago

Sure, we'll apply the requested modifications and will report back.

lgeistlinger commented 3 years ago

@tosfos : I updated the files here as requested. This includes resetting the experiment counter also in the signature csv and the small format modifications that you listed. Please check.

tosfos commented 3 years ago

Should be all done!

We have to work on the Curator and Revision editor fields as I mentioned. Everything else looks OK. Please review and let me know.

tosfos commented 3 years ago

If everything looks OK, feel free to start editing. The work we need to do on the editor fields won't be affected.

lgeistlinger commented 3 years ago

Looks great @tosfos! I'll do some more manual screening over the pages today and will follow-up with some more systematic import-export checks in #41.

lgeistlinger commented 3 years ago

There seems to be a small hiccup with the import as there are apparently a handful of experiments without a signature which shouldn't be the case - I am looking into how we produce the csv files to clarify.

tosfos commented 3 years ago

I confirmed that the error is in the CSV files. The Experiment CSV contains certain experiments that have no corresponding signatures in Signatures.csv

lgeistlinger commented 3 years ago

Yes, I have fixed this and will provide updated csv files in a bit. I'm currently doing some more checks on them. We could potentially fix this manually directly on the site for these handful of experiments - but maybe it's easiest to just re-import? What do you think?

lgeistlinger commented 3 years ago

@tosfos the updated csv files are here and I think it's easiest (at least from our side) if we would just re-import those files. There was a misalignment between experiment and corresponding signatures for the experiments listed above as those had small inconsistencies in the free-form columns.

tosfos commented 3 years ago

Done. Looks like there are 2 missing Experiments, but the Studies and Signatures are in there: https://bugsigdb.org/Help:Cleanup

lgeistlinger commented 3 years ago

Thanks! Looks like another corner case. I'll look into that.

lgeistlinger commented 3 years ago

Closing this and continue discussion under #41

waldronlab / BugSigDB

Re-import of data #19