Metadata standardization

khoroshevskyi commented 2 years ago

At the moment, geofetch can download, filter, save metadata for the specific accessions in GEO. But metadata in GEO is stored in different, messy ways. Some of the information can be redundant and some can be stored in different places.

e.g. sample genome information may be stored in 3 (or more) different keys (dictionary keys):

'Sample_description': ['assembly: 'hg19', ...]
"Sample_characteristics_ch1": ['genome build': 'hg19', ...]
"Sample_data_processing": ['Genome_build': 'hg19', ...]

To create good, standardized PEP .csv metadata file, all information has to be be carefuly preprocessed. Especially this can be useful to create new endpoint in pephub.

In my opinion we have to create new class, or set of function, that will be separated from geofetch and will standardize all GEO metadata.

khoroshevskyi commented 2 years ago

@nsheff @nleroy917

nsheff commented 2 years ago

Yes. I think you are right that this is outside the scope of geofetch, at the moment. The first goal needs to be just to get the data as it exists. The next goal could be to sanitize and unify it.

The first step can be completely automated and that's what geofetch should do.

This second step is a much larger project and will require a human to be involved. It could also be an application area for some techniques from natural language processing.

I think we should start thinking about this but it is not going to be solved right at the beginning, so don't let it hold up finishing the first goal.

nleroy917 commented 2 years ago

But metadata in GEO is stored in different, messy ways.

Seems like the motivation behind the NIH’s Big Data to Knowledge initiative. It sounds like standardizing messy meta-data was the goal of things like DataMed and the DATS model.

pepkit / geofetch

Metadata standardization #47