waldronlab / BugSigDB

A microbial signatures database
https://bugsigdb.org
6 stars 6 forks source link

Never re-use Study IDs (was: Wrong experiment associated with study_567) #143

Open lwaldron opened 2 years ago

lwaldron commented 2 years ago

https://bugsigdb.org/Study_567 presents a Chinese study of colorectal adenoma, created recently by user Jeshudy on May 30, 2022. But Experiment 1 is from a Russian study on Parkinson's disease created by user Fcuevas3 on Nov 29, 2021. Experiment 2 seems to actually belong to Study 567, created on May 31 by Jeshudy.

lwaldron commented 2 years ago

This misplaced experiment https://bugsigdb.org/Study_567/Experiment_1 seems to belong to the Russian study https://bugsigdb.org/Study_22 that has a reviewed Experiment 1 (https://bugsigdb.org/Study_22/Experiment_1) that seems correct, and to refer to the same real-life experiment as https://bugsigdb.org/Study_567/Experiment_1, only with some curation inconsistencies in the group naming and statistical test.

The misplaced https://bugsigdb.org/Study_567/Experiment_1 is marked as "incomplete" - I'll look for some other incomplete pages to see if that is related to being misplaced.

tosfos commented 2 years ago

Copying from my email:

It looks like the Study and Experiment were created together. Then someone deleted the Study (see here) without deleting the Experiment page, which is very easy to do. When a new Study page was then created, the Experiment 1 page was still in the wiki, which caused all the confusion.

We implemented a system to help find orphaned pages as happened here. You can see it here.

tosfos commented 2 years ago

Please let me know if we should clean up this particular Study.

lwaldron commented 2 years ago

Yes please, if you can. In other ones that don't have some correct and some incorrect experiments like #145, I think we could fix it on our own just by changing the PMID of the study.

tosfos commented 1 year ago

This can be closed

lwaldron commented 1 year ago

It looks like we can do the cleanup, but I’m not sure I understand how this signature is listed as orphaned, but links back to what appears to be the parent? Does it have to do with it being marked as Incomplete?

https://bugsigdb.org/Study_438/Experiment_4/Signature_1

--

Levi Waldron

Associate Professor

Department of Epidemiology and Biostatistics

CUNY Graduate School of Public Health and Health Policy

Institute for Implementation Science in Population Health

55 W 125th St, New York NY 10035

https://waldronlab.io

Join the microbiome Virtual International Forum: https://microbiome-vif.org

lwaldron commented 1 year ago

This seems to have happened again here for https://bugsigdb.org/Study_340. I feel like we need to lock down deleting studies because this is really hard to spot except by chance and I can't tell which user deleted this study. A new user is associated with the creation of this new study now associated with the wrong experiments. See https://github.com/waldronlab/BugSigDBcuration/issues/23

ftzohra22 commented 1 year ago

@lwaldron just to clarify I deleted the study associated with the experiment in https://bugsigdb.org/Study_340 to remove duplicates after double checking. We can perhaps find another way to remove duplicate studies since this is not ideal.

lwaldron commented 1 year ago

My understanding is that if you delete a study you need to either delete or move its experiments, or else they will be orphaned and then adopted by the next study anyone creates. Should these experiments be moved to another study or deleted @ftzohra22 ?

@tosfos what do you think about ensuring that new studies never reuse freed URLs? Ie a new study number is always Nmax + 1 where Nmax is the highest study number previously used? Then visiting a deleted study URL could somehow indicate that this study has been deleted. This display of orphaned experiments with an unrelated study is bad.

lwaldron commented 1 year ago

Now there are no experiments associated with https://bugsigdb.org/Study_340. @ftzohra22 did you delete or move them? Can you provide a list of relevant edit, delete, or move actions you've done? I don't recognize any clues in the Study 340 history so it's really bewildering to me what has been happening.

ftzohra22 commented 1 year ago

These experiments needed to be completely deleted as they were just duplicates of another another study @lwaldron (https://bugsigdb.org/Study_95). For some reason, I'm guessing the experiments originally didn't get cleared when I was removing duplicates. I deleted all the experiments https://bugsigdb.org/Study_340 (deletion log: https://bugsigdb.org/Special:Log?type=delete&user=Fatima&page=&wpdate=&tagfilter=&subtype=) to avoid confusions for the new curator. They hadn't added any experiments yet.

lwaldron commented 1 year ago

I'm renaming this to "Never re-use Study IDs" which I think is a sufficient solution to ensure that this kind of mismatch never happens - can you confirm @tosfos ? This is from my above question, but in any case I'd like to do whatever we can to ensure that orphaned experiments are never displayed under an unrelated study.

what do you think about ensuring that new studies never reuse freed URLs? Ie a new study number is always Nmax + 1 where Nmax is the highest study number previously used? Then visiting a deleted study URL could somehow indicate that this study has been deleted. This display of orphaned experiments with an unrelated study is bad.

lwaldron commented 1 year ago

I've confirmed that as soon as a study is deleted, its URL enters the front of the queue to be reused the next time a new study is created. I think this is bad behavior, the URL of a deleted study should never be reused during new study creation, it should just continue showing the "deleted study" message.

tosfos commented 1 year ago

Reserving Study numbers is not so simple. What do you think about using the PMID as a Study's unique ID? The URL of the study would be something like https://bugsigdb.org/30405010, but the Study's text title would be shown in most places (just like now). That would serve to ensure that nobody creates a study with the same PMID as an existing study. And it would also solve the issue you identified. Can we force all studies to have a PMID? If not, there might be a workaround for non-PMID studies.

Another thing we can think about is combining all Experiments and Signature pages directly onto the Study page instead of using subpages. There are advantages to having them separated out into subpages, but with the performance improvements, it's at least something we can think about. This would neatly keep all the Study/Experiment/Signature units together. Again, not a simple switch, and it can have significant ramifications that we need to think about, but if you have thoughts on it, let me know.

lwaldron commented 1 year ago

What do you think about using the PMID as a Study's unique ID?

It's so close to being a good solution - we have only a couple instances of desired non-PMID studies (https://bugsigdb.org/Study_562 and https://bugsigdb.org/Study_608), the rest have been mistakes by curators. Would it be possible to use DOI (although these contain a /), or the old-fashioned Study_N as a backup for non-PMID studies? If not I am still be tempted by the PMID as unique key option. The other valid use of two different studies with the same PMID because they have different study designs is very rare and dominated by mistaken duplication of the study instead of editing the existing study, so I think eliminating that use case would be a net benefit.

Another thing we can think about is combining all Experiments and Signature pages directly onto the Study page instead of using subpages.

This sounds maybe too drastic for the scope of the problem right now at least, but good to keep in mind as a possibility. BTW we've had a bunch of new curators start in the past week as part an internship program, which is helping to identify issues, but it will probably calm down again in a couple weeks.

tosfos commented 1 year ago

I'm thinking a good compromise is to promote adding Studies by PMID only. We'll include instructions for how to add Studies by the old-fashioned Study_N (probably) for Studies that don't have a PMID. These non-PMID Studies will end up as being reusable if they are deleted, but I assume that will be rare enough to be acceptable.

We'll create a new form for adding studies by PMID. Once that looks good, we'll write a script to rename all the existing Studies that include a PMID to the new PMID naming system.

lwaldron commented 1 year ago

That sounds like an excellent compromise - normal users should have to intentionally go out of their way to create a study without a PMID. Except in the two unusual examples I gave, every instance so far of a user adding a non-PMID study has been an error (and there have been a couple dozen of those errors). When a study has a PMID, URI and DOI are unnecessary and a waste of time for users who think they need to provide them.

lwaldron commented 1 year ago

Reserving Study numbers is not so simple.

Just wondering if rather than reserving study numbers it would be straightforward to add a rule that creating a new study always uses N+1 where N is the largest existing study? The smaller and easy to remember study IDs are kind of handy and easy to remember, but not worth a lot of effort and I do want to eliminate reusing study IDs for different studies. Otherwise if we switch to PMID study numbers let's not change existing pages because there are already a lot of links to them.

tosfos commented 1 year ago

The smaller and easy to remember study IDs are kind of handy and easy to remember, but not worth a lot of effort and I do want to eliminate reusing study IDs for different studies.

It's a bit complicated, because we have to check for Studies that previously existed and then were deleted later so that we don't reuse them. We'll have to check how much effort is involved.

Otherwise if we switch to PMID study numbers let's not change existing pages because there are already a lot of links to them.

I don't think this is a major concern. When we rename pages, it automatically leaves a redirect behind.

lwaldron commented 1 year ago

It's a bit complicated, because we have to check for Studies that previously existed and then were deleted later so that we don't reuse them. We'll have to check how much effort is involved.

Is this only to protect against the corner case of a Study being created and then deleted before another Study with a larger number is added? That doesn't seem like an important corner case to me, what I don't like are 1) the problem of orphaned sub-pages getting assigned to a new Study and 2) having a study page that used to refer to one publication later refer to a different publication instead of just showing a "deleted page" message. Using PMID with a slightly hidden possibility of doing non-PMID studies does seem like a completely feasible solution with only a very slight loss of convenience of having smaller / more memorable study numbers currently.

tosfos commented 1 year ago

Is this only to protect against the corner case of a Study being created and then deleted before another Study with a larger number is added?

Not just that case. I think it will happen if any Study is deleted (not just the most recent one). It will become the next number used when someone tries to add a new Study and that will cause the problems you mentioned:

the problem of orphaned sub-pages getting assigned to a new Study

having a study page that used to refer to one publication later refer to a different publication instead of just showing a "deleted page" message.

We'll have to figure out the best option.