glygener / glygen.cfde.generator

Java program for the generation of CFDE metadata files from GlyGen data.
GNU General Public License v3.0
0 stars 1 forks source link

GlyGen FAIRness Assessment #23

Closed jeet-vora closed 2 years ago

jeet-vora commented 2 years ago

Hi Avi, We recently performed GlyGen's datapackage FAIRness assessment using the CFDE-C2M2-FAIR-Datapackage-Assessment appyter. Here is the result -

From the FAIRshake Insignia results, we noticed that we have some red squares and upon inspection, we found that the criteria for the red square are not applicable to us because we are a knowledgebase that does not have any biomedical data. The 'ratio of files associated with a subject' will always be zero because we do not have patient data. No information for such criteria will always result in red squares and lower the FAIRness of the datasets. To avoid such an issue, we propose 1) To grey out the squares for the criteria that do not have information. These criteria will be optional. 2) Add a knowledgebase option on the landing page so that when the option is selected few of the criteria become optional and are greyed out in insignia. Let us know what you think of the issue and the potential solutions. https://appyters.maayanlab.cloud/CFDE-C2M2-FAIR-Datapackage-Assessment/66d8628ed844a8f8e1f7330262fbc652a480ad9b/

jeet-vora commented 2 years ago

Dear Jeet,

Thank you for your feedback about this assessment appyter. As the metrics were originally devised to assert maximal coverage of the C2M2, the score is a product of the fact that Interoperability at the subject level would not be possible with your data--in this way it's not necessarily "wrong," just an ill poised metric in the context of your DCC.

In light of this, we patched the appyter to be more keen on catching Non-applicability by using NaN if, for instance, there are no subjects/biosamples for all relational assertions. The downside of this is that DCCs with biosamples that haven't associated any of them will now be shown as if it's not applicable.

A "knowledgebase" option does not seem appropriate to us since knowledge derived from experiments on biological entities should be able to associate the original subject (albeit likely with great difficulty) even if it's a knowledge base. Perhaps a way to address this is for you to construct your own rubric which contains a subset of the CFDE metrics and perhaps additional ones that are more suited to the data you collect.

We ran your datapackage against the new patch and got this:

image.png image

We were confused as to why you had taxonomy and anatomy defined if indeed your data can have no association with biosamples.

jeet-vora commented 2 years ago

Dear Daniel,

Thank for for you quick response. I am Rene and I am working with Jeet on the GlyGen CFDE project. As you said the current metric with the red spots gives the impression that we could improve our fairness when in fact we cannot (due to the type of data/database we have). I also understand that relaxing all the corresponding constrains to make it fit knowledge-bases will be counter productive.

I agree, that the best option might be to construct our own rubric. I wonder how much support your API would offer for this and the evaluation. Our would have to implement our own version of the aptymer to do an automated fairness evaluation?

We were confused as to why you had taxonomy and anatomy defined if indeed your data can have no association with biosamples. Lets use UniProt as an example for knowledge-bases (GlyGen is pretty much the same). UniProt has information about protein (human protein) and lets say about PTMs in different tissues. But these information comes from dozens of papers and dozens of groups. How many samples where analyzed is unknown at this stage. Maybe all the data from one group/paper come from the same sample or maybe not. All we know, its a human protein with PTMs coming from e.g. liver cells. Similar to the sample information the subject information is lost to us. All the data might be coming from the same suject … or maybe not. One solution to this problem could be to create “THE human liver” biosample and associate all data to this (similar with “THE human” subject). But this is misleading at best, since it gives the impression that all the data came from a single sample from a single subject. We could also create for each PTM a separate artificial biosample and subject. But that's most like incorrect as well. When we started and talked with the coordination center people about this, they also did not like these ideas but rather recommended to make use of collections which allows us to associate proteins/glycans with taxonomy and anatomy.

I hope this helps.

Best, René

jeet-vora commented 2 years ago

Based on discussion with Avi, he proposed three things 1) Use the appyter and submit the assessment report to NIH as is 2) Create GlyGen specific appyter for submission package and also for GlyGen datasets 3) Use FairShake API for GlyGen dataset assements and create a report

Discussion with Rene

Short term - Use appyter and submit the report to NIH Long term - Create GlyGen specific appyter by using GlyGen defined metrics Additional - Each glycan and protein detail page can have its own metric to denote completeness or empty information in section. eg One blue block for glycosylation information.