Have you heard of RRIDs, perhaps having unique persistent ids would help?

open-science-promoters / reagentsio_website

making all scientific reagents easily and persistently identifiable, and described in a computer-readable way in the published research literature

http://reagents.io

MIT License

8 stars 1 forks source link

Have you heard of RRIDs, perhaps having unique persistent ids would help? #12

Closed bandrow closed 6 years ago

bandrow commented 6 years ago

RRIDs should be added to each reagent.

Example Reagents: Antibodies, authority = antibodyregistry.org, RRID pattern = RRID:AB_###, example = RRID:AB_90755; resolution service = n2t.net/RRID:AB_90755, resolution in xml = http://scicrunch.org/resolver/RRID:AB_90755.xml

Organisms, authority = varied, RRID pattern = vaired, examples = RRID:FlyBase_FBst1014563; RRID:WB-STRAIN:LS3506; RRID:ZIRC_ZL11074.05, resolution service same

Cell Lines, authority = cellosaurus, RRID pattern = RRID:CVCL_###, example = RRID:CVCL_QZ16, resolution service same

Tools, authority = scicrunch registry, RRID pattern = RRID:SCR_####, example = RRID:SCR_001905, resolution service same

jcolomb commented 6 years ago

Thanks a lot for your participation and welcome!

RRID has to be implemented, that is for sure. I am hoping to get RRID specialist on the team, are you one? Is the RRID pattern sufficient to get all information stored? Can we crawl the database to get more information automatically?

We need to be careful though. For example what is the data linked to "RRID:FlyBase_FBst1014563" ? no hit on scicrunch: https://scicrunch.org/browse/search?query=FlyBase_FBst1014563&l=FlyBase_FBst1014563

On top of that, FBst numbers are not persistent identifiers (they disappear when the flies are discarded from stock centers): we need to get other more persistent information in the table...

bandrow commented 6 years ago

did you try https://scicrunch.org/resolver/FlyBase_FBst1014563 or https://scicrunch.org/resolver/FlyBase_FBst1014563.xml?
I would not try root scicrunch as this will usually give you issues
the resource portal should resolve all identifiers though we have trouble with some constructs, usually with : but in this case with _ as well. I can fix that in the meantime the fragment or space version works. https://scicrunch.org/resources/Any/search?q=FBst1014563&l=FBst1014563
RRIDs are PUIDs, we are funded to work with the stock centers and back their catalogs; when they have stuff that is discontinued, it looks like this https://scicrunch.org/resources/Any/search?q=discontinued&l=discontinued or https://scicrunch.org/resources/Any/search?q=%22not%20available%22&l=%22not%20available%22 I think that we need to unify these tags they are different across all types of resources...always more to do.

I envision this working as follows:

someone comes in to register some resource to you
you query our data to see if exists a. we bring back data with RRID for your user b. we give your user option to register for things we take (can give you list of things we will accept)

Thoughts?

On Wed, Apr 25, 2018 at 1:15 AM, Julien Colomb notifications@github.com wrote:

Thanks a lot for your participation and welcome!

RRID has to be implemented, that is for sure. I am hoping to get RRID specialist on the team, are you one? Is the RRID pattern sufficient to get all information stored? Can we crawl the database to get more information automatically?

We need to be careful though. For example what is the data linked to "RRID:FlyBase_FBst1014563" ? no hit on scicrunch: https://scicrunch.org/browse/search?query=FlyBase_FBst1014563&l=FlyBase_ FBst1014563

On top of that, FBst numbers are not persistent identifiers (they disappear when the flies are discarded from stock centers): we need to get other more persistent information in the table...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-384200386, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNcvrVvzR0xHduSdEHqZQAZgiap5Sks5tsDCygaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

jcolomb commented 6 years ago

happy to see someone with insider knowledge about RRID helping!

The RRID initiative is great, I had no time to dig deep into what is saved in the database, yet. However, I think that it can be as great as it wants, but as long as RRID numbers do not show up into publications, it will not reach 1% of its potential. And that is the problem I would like to tackle with this project: My vision is that the tools we will create will allow scientists to get an easy way to use the RRID which are existent, not create new ones (but of course the goal of the project may change with the community growing behind it).

" they are different across all types of resources": that is also a problem I want to deal with. The strategy is to build one standard for each resource with its own vocabulary, and then use machine to transform it into one standard available for all categories (probably using ontologies on the long term ?).

I would also be happy to discuss use case scenarios. So far, I was only envisaging people will come to transform their inventory into the new formats, getting reagent tables from the data they have. It would be interesting to get some additional data automatically, by calling the RRID (or other open) databases.

From the links you wrote, it seems RRID is only saving fly stock name and FBst number, right?

bandrow commented 6 years ago

"but as long as RRID numbers do not show up into publications, it will not reach 1% of its potential." agreed, check out: https://www.cell.com/star-methods http://www.jneurosci.org/content/preparing-manuscript also: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=RRID&btnG=

From the links you wrote, it seems RRID is only saving fly stock name and FBst number, right?

Not exactly sure what you mean, the RRID contains the metadata from FlyBase for each fly.

On Fri, Apr 27, 2018 at 1:07 AM, Julien Colomb notifications@github.com wrote:

happy to see someone with insider knowledge about RRID helping!

The RRID initiative is great, I had no time to dig deep into what is saved in the database, yet. However, I think that it can be as great as it wants, but as long as RRID numbers do not show up into publications, it will not reach 1% of its potential. And that is the problem I would like to tackle with this project: My vision is that the tools we will create will allow scientists to get an easy way to use the RRID which are existent, not create new ones (but of course the goal of the project may change with the community growing behind it).

" they are different across all types of resources": that is also a problem I want to deal with. The strategy is to build one standard for each resource with its own vocabulary, and then use machine to transform it into one standard available for all categories (probably using ontologies on the long term ?).

I would also be happy to discuss use case scenarios. So far, I was only envisaging people will come to transform their inventory into the new formats, getting reagent tables from the data they have. It would be interesting to get some additional data automatically, by calling the RRID (or other open) databases.

From the links you wrote, it seems RRID is only saving fly stock name and FBst number, right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-384897725, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNUDIf-wzO8szQmHdH45SIFMJXBOoks5tstHQgaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

kyook commented 6 years ago

May I suggest RRIDs NOT be used for core Model Organism resources and reagents? Or at least include a MOD representative on your team that can mitigate instances where RRIDs might be useful for MOD communities?

RRIDs are of great service for antibodies, cell lines and any other resource reagent where there does not exist a body with authoritative nomenclature oversight.

However Model Organism communities supported by core Model Organism Databases (WormBase, FlyBase, ZFIN (zebrafish), SGD (yeast), MGI (mouse), and RGD (rat)) should be the authoritative databases that get linked to and whose identifiers should be used. They are the definitive source of the ids that have been established over decades with the support of their communities, and hence are the end redirect for any identifiers, MODs take care of their own PUIs.

Further with the alliance of these databases happening now (ie. the Alliance of Genome Resources) nomenclature across the MO communities are receiving globally unique IDs, which lays moot any reason for the RRID effort to establish ids for MO resources.

That said, there are a number of ways we could work in harmony with RRIDs and help each other out.

jcolomb commented 6 years ago

"include a MOD representative on your team": can you help with that ? I would love to and tried to contact some of the flybase people I know (without success so far).

I now enough of flybase to know that FBst number cannot not be used (not persistent, same line can have multiple FBst numbers,...) , from what I have seen, the RRID pendant of FBSt numbers cannot be used either (only little data available and not linked to other PUI). However, using only the list of FBxx numbers of each genetic element does tell what it is in a persistent way, but it gives no information about where it was bought (and we would like both information to end in the reagent table, right?).

A combination of different identifiers might be the solution on the long term ?

kyook commented 6 years ago

I think there is a lot of room to come up with a solution that satisfies everyone here, and this is a great time to start pointing out where things could be improved on the side of the MODs- especially in terms of globally unique identifiers and persistent IDs.

I think we all agree that ids should not be reused and they need to persist. And I agree that there should be a minimum amount of information about a reagent that satisfies experimental reproducibility. Having more than one identifier in a table is not an issue for me or my colleagues, I would suspect, in fact I was planning to do that with my own article processing pipeline; however it would be pretty important to have some quality control with assigning ids and links.

Also the MODs would welcome anything that helps them to identify objects in a paper that gets curated in their databases. Forgive me if you already know about this but Flybase and other MODs have come up with a reagent table in collaboration with Cell Journal - the STAR methods table. https://www.cell.com/pb-assets/journals/research/cell/methods/Methods%20Guide.pdf?code=cell-site

jcolomb commented 6 years ago

I have seen both the STAR "KEY RESOURCES TABLE" you mention and the ART and was unhappy with both solutions.

Especially with the STAR one: the template is a .doc file (!!) and the table is published behind the paywall. One of my motivation to start this project is to create a better standard before that kind of behaviour could spread to other publishers. I want that information to be openly available and computer readable (FAIR and open)!

I liked the ART approach (even if there are basic problems with the computer readability of the table), and I think it will work better and faster if developed in the open. In a perfect world, the table could indeed allow MOD to import most of the data they need automatically, and be very easy to produce for the researchers.

bandrow commented 6 years ago

Hi Julien,

I know that the world is full of perfect ideas and imperfect implementations that exist, but before you create the perfect format for citing all reagents, please consider the following:

is the information recoverable/findable with RRIDs or STAR? would it be recoverable without it? (Vasilevsky et al 2013 and we (Bandrowski et al 2015), showed that ~50% of resources are finable without RRID, ~90% are findable with RRID; so a system that improves the situation by 80% in my book is pretty good)
is it possible to parse STAR? (~12 lines of Python should do it; if you want our code...happy to share)

Just my thoughts. anita

ps. The RRID system aggregates and backstops authorities, if needed, and does not create competing identifiers. The ID assigning authorities are specific to each resource type and are the most authoritative for that resource. If there is no authority for a resource, we do not have RRIDs. Yes, we do need authorities including MODs on the advisory board! Would love to have you Karen.

On Wed, May 2, 2018 at 2:28 PM, Julien Colomb notifications@github.com wrote:

I have seen both the STAR "KEY RESOURCES TABLE" you mention and the ART https://wiki.flybase.org/wiki/FlyBase:Author_Reagent_Table_(ART) and was unhappy with both solutions.

Especially with the STAR one: the template is a .doc file (!!) and the table is published behind the paywall. One of my motivation to start this project is to create a better standard before that kind of behaviour could spread to other publishers. I want that information to be openly available and computer readable (FAIR and open)!

I liked the ART approach (even if there are basic problems with the computer readability of the table), and I think it will work better and faster if developed in the open. In a perfect world, the table could indeed allow MOD to import most of the data they need automatically, and be very easy to produce for the researchers.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-386127225, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNWvFeO5YBU2F7lVKr3eafPx4cEB6ks5tuiT4gaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

jcolomb commented 6 years ago

Hi Anita, Very happy that you signed with your name! and very happy to see you discussing this thanks to this project.

About the STAR: parsing badly designed tables one has to get out of a pdf which is behind a paywall and which structure may change over time is not ideal. If you think that elsevier is implementing it, you stop wondering why it is so badly designed and realise that it was designed to prevent machine readability: this will just create a way for that company to gather data they will be able to sell or use for their own profit. "When you have the choice, always go for quality raw data instead of trying to clean it on a later phase."

About RRID: It is clear that we will use RRID, but we need to clearly define for each reagents which PUI makes most sense. In some cases, RRID may be enough, in other cases it might be insufficient and sometimes even not necessary. Is that what you meant?

If we take the fruit fly example, let's take RRID:BDSC_2740: it seems the only information linked there is the name: l(3)84Fb1 red1 e4/TM3, Sb1 (actually the name is using superscripts instead of []) and the database (here bdsc). What I would need to answer the question: what other study use this, is a way to get single genetic elements (for example red[1]:FBal0014517 ) which are actually in the flybase database, linked to a different identifier (which is by the way not persistent: FBst0002740). This got even more complex when the same flies can be obtained at different stock centers and will therefore get different RRIDs.

So do we agree we should first develop standards making sure we have all necessary information and then looking at ways to reduce that to few PUI, and/or ways to get all PUI without manual work ?

bandrow commented 6 years ago

Yes sure when RRIDs make sense for your use case they should be used.

The BDSC and all fly stock centers actually work closely with the FlyBase registering all of their data and making the names and identifiers consistent. So here when someone uses a stock center number, that is already mapped to a FBstID the information is recoverable, perhaps not totally simple, but recoverable. When the author uses a name the fly information is less recoverable. So here we have won, because the author uses a number they are familiar with, which they can easily find in their lab, and the informatics expert has to do a little work, but ultimately is not guessing about which fly this is. BDSC numbers are in FlyBase, the authority for flies. http://flybase.org/reports/FBst0002740 for the BDSC fly that you are talking about.

The philosophy here is that to fix our collective problems with reproducibility, we ask informatics experts to do real work and we ask authors to do work that makes a piece of information more recoverable. Together we can get there, but pushing off too much work in one set of people is not good. The papers took years of work to complete, if they can report this work in such a way that it makes it possible for informatics experts to recover identifiers that is wonderful. If we ask them to become informatics experts, we will fail.

Regards, anita

On Fri, May 4, 2018 at 2:48 AM, Julien Colomb notifications@github.com wrote:

Hi Anita, Very happy that you signed with your name! and very happy to see you discussing this thanks to this project.

About the STAR: parsing badly designed tables one has to get out of a pdf which is behind a paywall and which structure may change over time is not ideal. If you think that elsevier is implementing it, you stop wondering why it is so badly designed and realise that it was designed to prevent machine readability: this will just create a way for that company to gather data they will be able to sell or use for their own profit. "When you have the choice, always go for quality raw data instead of trying to clean it on a later phase."

About RRID: It is clear that we will use RRID, but we need to clearly define for each reagents which PUI makes most sense. In some cases, RRID may be enough, in other cases it might be insufficient and sometimes even not necessary. Is that what you meant?

If we take the fruit fly example, let's take RRID:BDSC_2740: it seems the only information linked there is the name: l(3)84Fb1 red1 e4/TM3, Sb1 (actually the name is using superscripts instead of []) and the database (here bdsc). What I would need to answer the question: what other study use this, is a way to get single genetic elements (for example red[1]:FBal0014517 ) which are actually in the flybase database, linked to a different identifier (which is by the way not persistent: FBst0002740). This got even more complex when the same flies can be obtained at different stock centers and will therefore get different RRIDs.

So do we agree we should first develop standards making sure we have all necessary information and then looking at ways to reduce that to few PUI, and/or ways to get all PUI without manual work ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-386553528, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNUUOfbAEp8uyztXpKWmzgQKPfmi_ks5tvCPzgaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

jcolomb commented 6 years ago

Yep, therefore the idea that RRID (and/or all other important PUI and information) get saved automatically upon purchase, and can be reexported when writing the paper, in a table that can be published independently (and curated automatically). Ideally, authors only have to update information on their stock list (new nickname,new crosses, ...) as they are doing now and link the inventory to the experiments and protocols (via an ELN or manually). Then the table can be produced automatically.

How can we get there?

PS: FBst numbers are not PUI, they are erased when the flies are culled from stock centers.

jcolomb commented 6 years ago

And do not be too optimistic here, I have seen fly lab inventories:

BDSC numbers are not saved in all lab inventory (not often actually)
About 30% of flies are BDSC flies, 70% are flies absent from stock centers (new mutants or new mutation combination)
If the flies were culled from stock centers, the BDSC number cannot trace to the fly genotype anymore.

bandrow commented 6 years ago

Yes that would be the absolutely ideal / amazing situation!

On Fri, May 4, 2018 at 12:28 PM, Julien Colomb notifications@github.com wrote:

Yep, therefore the idea that RRID (and/or all other important PUI and information) get saved automatically upon purchase, and can be reexported when writing the paper, in a table that can be published independently (and curated automatically). Ideally, authors only have to update information on their stock list (new nickname,new crosses, ...) as they are doing now and link the inventory to the experiments and protocols (via an ELN or manually). Then the table can be produced automatically.

How can we get there?

PS: FBst numbers are not PUI, they are erased when the flies are culled from stock centers.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-386708955, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNWEC_4zu7TgJ7Xhb9g-gWnU1o5FKks5tvKvsgaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

bandrow commented 6 years ago

I don't know about flies, but for antibodies the same is true.

One of the things I heard that was really encouraging recently was that I was speaking to a group of researchers and one of them said that their society journal now enforces RRIDs and their lab had a lot of work that had to be done when they were first asked to get this information together, but she said, now we keep much better notes. This is something that is really amazing to hear!

On Fri, May 4, 2018 at 12:33 PM, Julien Colomb notifications@github.com wrote:

And do not be too optimistic here, I have seen fly lab inventories:

BDSC numbers are not saved in all lab inventory (not often actually)

About 30% of flies are BDSC flies, 70% are flies absent from stock centers (new mutants or new mutation combination)

If the flies were culled from stock centers, the BDSC number cannot trace to the fly genotype anymore.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcolomb/reagentsio_website/issues/12#issuecomment-386710142, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrNeYMWpy4FG96TCGPdsh78bC5ZI25ks5tvK0WgaJpZM4TijXi .

-- All key biological entities deserve an #RRID! orcid.org/0000-0002-5497-0243

jcolomb commented 6 years ago

I close that discussion at this point with the conclusion that we need to

get RRID information into the table
get additional information if needed